Fix searching in RTL PDFs
AbandonedPublic

Authored by ngraham on Feb 4 2018, 3:25 PM.

Details

Reviewers
ltoscano
Group Reviewers
Okular
Summary

BUG: 207748

Since arabic search does not work properly in all but pdf backends, this is a quick attempt to fix this problem. I assumed that text in okular document is in the logical order[ it is a bug by itself]. So by mirroring the search text, the search function works again.

The limitation:

  • you can not search arabic and english text together.

Future work:
we need to check that text generated by poppler is placed in Visual order, so when we copy it and paste it in text editor is still readable.

Test Plan

Migrated this patch from https://git.reviewboard.kde.org/r/125442/ since it had whitespace errors and the submitter disappeared.

  • Okular compiles and all tests pass (except for parttest, which was already failing in master)
  • Don't have any RTL PDFs or the ability to read or write in any RTL languages, so unable to test the functionality. But on the reviewboard page, folks said it worked, and the diff is the same.

Diff Detail

Repository
R223 Okular
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
ngraham created this revision.Feb 4 2018, 3:25 PM
Restricted Application added a project: Okular. · View Herald TranscriptFeb 4 2018, 3:25 PM
ngraham requested review of this revision.Feb 4 2018, 3:25 PM
ltoscano resigned from this revision.Feb 4 2018, 3:39 PM
ltoscano added a subscriber: ltoscano.

Please change "migrated from..." with the proper content from the old reviewboard patch, and resubmit it using the original author.
The note about "this was in reviewboard" should not be in the final commit message, but the original content should be.

ngraham updated this revision to Diff 26510.Feb 4 2018, 3:40 PM

Update author

ngraham edited the summary of this revision. (Show Details)Feb 4 2018, 3:44 PM
ngraham edited the test plan for this revision. (Show Details)

I tested okular with the patch. I used 2 PDF files in Hebrew. I attached them so others can test. One was downloaded using Wikipedia's Download-as-PDF option. The other was downloaded from random search results, when looking for Hebrew PDFs.


The results are as follows

  1. Okular was able to find the text I was searching for (Success).
  1. But it is looking for the text inside each line from left to right and not from right to left (which is the reading/writing direction). When there is more than one occurrence of the text in the same line, it will find the last one first, and the first one at the end. I'm attaching a gif to illustrate this.

  1. I think the problem is caused because Okular treats the whole text as if it is typed backwards. For example, copying text from Okular results in the text being pasted backwards. But when trying to copy the same text from Firefox (when used as a PDF reader) it copies the text correctly. I'm attaching another gif to illustrate this.

So in regards to usability - the current patch is better than nothing. It enables searching for text that is written in a RTL language and should be adopted.

In general, Okular might need some improvements in regards to RTL languages (Hebrew, Arabic, Persian, Yiddish). According to wikipedia (https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), there are more than 550 million speakers of those languages.

ngraham abandoned this revision.Feb 23 2018, 5:01 PM

Thanks for the test! However, the original author of this patch recent re-appeared here on Phabricator and submitted a better one: D10455: Add RTL support for search, copy & paste in pdf.

I'm closing this patch in favor of his. Would you mind testing that? Thanks again!

The problem is not only with search even if you copy a text the copying generates mirrored texts, it seems to me that Okular deals all texts and words as LTR texts.

Restricted Application added a subscriber: okular-devel. · View Herald TranscriptJul 29 2018, 10:23 AM
This comment was removed by userkde.

@ngraham Thank you for your job on this bug .

I want to know what is the new about this bug, the problem is not only with search even if you copy a RTL text the copying generates mirrored texts. maybe the problem is deeper than it seems. I think Okular deals all texts and words as LTR texts even RTL texts.

Thanks! Just so you know, the work in this patch moved to D10455: Add RTL support for search, copy & paste in pdf.