Add RTL support for search, copy & paste in pdf
Needs RevisionPublic

Authored by fahadalsaidi on Feb 12 2018, 9:23 AM.

Details

Reviewers
aacid
ltoscano
Group Reviewers
Okular
Summary

Right now all text are sorted in LTR order in correctTextOrder().
Since poppler has fixed it since 0.40, I've just ported reorderTex()
from poppler to okular and it works. I am not sure what is unicodeTypeL()
& unicodeTypeR() mean in poppler context so I just use isRightToLeft() instead.

This patch intended to fix the following bugs:
BUG: 207748
BUG: 184399

Diff Detail

Repository
R223 Okular
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
fahadalsaidi requested review of this revision.Feb 12 2018, 9:23 AM
fahadalsaidi created this revision.
aacid added a comment.Feb 12 2018, 9:32 PM

Thanks for trying to fix this :)

Commented code should not be there.

this needs an autotest though to make sure we don't break it in the future

Are you sure this needs to be done at this level? i.e. have you tried this on xps/chm/djvu/dvi with RTL text to make sure you're not breaking that?

Also, in the poppler bug you mentioned this should happen in poppler, and now you're making a patch for okular, will you close the bug in poppler then? Or should this happen in poppler?

fahadalsaidi edited the summary of this revision. (Show Details)

Well, I was over-optimistic. This patch is related only to BUG: 207748 & BUG:184399. The other back-ends have many problem not related to this patch as following:

  • xps: okular crash when open arabic xps file and if it can open it, it renders it in wrong way.
  • dejuv: searching & copying Arabic text are fine since dejuv back-end has it own search function.
  • odt: there is broken in LTR or RTL. There is no text to search at all.
  • chm: Although okular can open it but searching is bad for LTR & RTL text, it is basically broken.

@aacid the poppler bug should be fixed, so anybody using QT interface can get the right order for RTL langs. Okular has its own way to sort words.

@aacid I want to write autotest but there is a bug #390357 in cmake test. I will try make autotest smiliar to test311232() but there is a problem, the pdf file doesn't load probably and it give me in an empty m_generator. please have a look on it.

fahadalsaidi edited the summary of this revision. (Show Details)

add autotest for searching in Arabic

fahadalsaidi retitled this revision from add rtl support to textpage to Add RTL support for search, copy & paste in pdf.Feb 13 2018, 10:10 AM
fahadalsaidi edited the summary of this revision. (Show Details)

@aacid any feedback?

I'm still unconvinced the text reversion is happening at the proper level.

Anyway meanwhile please change the test so that uses qtry_compare instead of the qtimer loop (i just fixed that for the other test)

And you have a memory leak in your function as reported by
valgrind --leak-check=full ./autotests/searchtest
please fix.

  • use QTRY_COMPARE

@aacid Thanks for the feedback.

I'm still unconvinced the text reversion is happening at the proper level.

I really a new to okular's code, so please guide me to the right direction.

And you have a memory leak in your function as reported by

valgrind shows that the leak happen in:

makeWordFromCharacters(QList<TinyTextEntity*> const&, int, int) (textpage.cpp:1209)

which needs to be fix but not in this patch since it is irrelevant to it.

aacid added a comment.Feb 19 2018, 8:26 AM

@aacid Thanks for the feedback.

And you have a memory leak in your function as reported by

valgrind shows that the leak happen in:

makeWordFromCharacters(QList<TinyTextEntity*> const&, int, int) (textpage.cpp:1209)

which needs to be fix but not in this patch since it is irrelevant to it.

No, the new code is causing that leak.

@fahadalsaidi, I recently migrated an older patch of yours from Reviewboard to Phabricator (D10298), but I just closed it in favor of this one. If D10298 is still relevant even with this patch, please commandeer and re-open it. Thanks!

I've tested this patch using 2 PDF files: One in Hebrew (it was downloaded using Wikipedia's Download-as-PDF option) the other in Arabic (it was supplied with the patch).

The results are as follows:

  1. Okular was able to find the text I was searching for (Success).
  1. But it is looking for the text in random areas. This is caused, probably, because Okular perceives each bulk of few words as located in different areas. And not as it should:

First line: first word to the right, second word to the right, third...
Second line: first word to the right, second word to the right, third...
Third line...
..
..
Last line: first word to the right, second word to the right, third..., last word to the right.

I'm attaching a gif to illustrate this (I'm looking for the first word in the text, a 3-letter word 'קוד' (which translates to code):

  1. Okular is unable to copy RTL text correctly. The order of the words gets mixed. This can be seen when trying to select the text - the way in which the text gets selected isn't intuitive, and can also be seen when pasting the selected text - in comparison to the original text. Here is a gif to illustrate this:

  1. The same problems occur when testing with the PDF in Arabic. Selecting the text is messy and searching for a letter in the text jumps between lines up and down.

Here is a gif to illustrate this (I'm looking for a common letter from the Arabic alphabet, the letter 'ا' (Alif, the first letter in the Arabic alphabet)):

To conclude:
In regards to searching:
unpatched okular: You need to type the text in reverse in order to search it.
okular + D10455: You can partially search by typing the text in the correct way, but the results will be scattered and not in the intuitive order of reading.
okualr + D10298: You can partially search by typing the text in the correct way, but the results will come up for each line, in the opposite order, but at least it will jump from the first line, to the second, to the third, and not sporadically like D10455.

In regards to copying text:
unpatched okular: Will copy the text in reverse (not just the order of the words, but also the order of the letters in a word), so ABC DEFG HIJK will paste as KJIH GFED CBA.
okular + D10455: Will copy parts of the text in the correct way.
So for example: ABC DEFG HIJK LMNO PQR STUVW XYZ
Will be pasted as: HIJK LMNO ABC PQR STUVW DEFG XYZ
okualr + D10298: Like unpatched okular

In terms of usability, the ability to search is important, so D10298 is better than D10455 (as it enables one to search in each line from the end of the line to the beginning of the line, and not sporadically.
In regards to copying text, they're both not so good.

@chfanzil , thanks for testing and sharing your findings. The patch doesn't touch search logic , it deals only with inner text layer in okular.

@aacid sorry to get to you late but Unfortunately I couldn't figure out how to fix the leak, my knowledge in C++ & programming is limited. I am just trying here. In other hand why not make ouklar deal with poppler directly to search & get text, the same way it does with dejuv backend? I know it is a lot of work but Here I am just asking.

What's our path forward here?

Restricted Application added a subscriber: okular-devel. · View Herald TranscriptMay 14 2018, 5:20 PM
aacid added a comment.May 14 2018, 5:59 PM

What's our path forward here?

Someone should address the memory leaks and someone should address the questions about the impact of this to other formats other than PDF.

@fahadalsaidi, are you able to work on those?

@fahadalsaidi, are you able to work on those?

Sorry, I don't have knowledge & time to fix the memory leaks. For the other question , I've already answered, here is it:

The other back-ends have many problem not related to this patch as following:
  • xps: okular crash when open arabic xps file and if it can open it, it renders it in wrong way.
  • dejuv: searching & copying Arabic text are fine since dejuv back-end has it own search function.
  • odt: there is broken in LTR or RTL. There is no text to search at all.
  • chm: Although okular can open it but searching is bad for LTR & RTL text, it is basically broken.
aacid requested changes to this revision.Feb 21 2020, 6:56 PM

Please move as a Merge Request in https://invent.kde.org/kde/okular

We have pre-commit CI and lots of checks including clazy and clang-tidy there so it's a much better place for doing the review/approval/merge of the code.

This revision now requires changes to proceed.Feb 21 2020, 6:56 PM
yaron added a subscriber: yaron.Dec 27 2020, 12:13 PM

I'm still experiencing this bug.

For example:
If I want to find the word Shalom I need to search for molahS in order to find a match.

Anything I can help with regarding this patch?

I read the reported test results as if this doesn’t work correctly. However, without this patch it doesn’t work at all, so that may already be an argument to accept this patch.

Text handling and search is IMHO broken in general, and needs to be redone. The current approach reorders all TextEntities, to give a nice left-to-right-top-to-bottom order. That is nice in general, but I think it should be done only on paragraph level, not on character level.

What needs to happen next for this patch, is probably to rebase it on master, and submit it on invent.kde.org. Then we can see together how to fix the mentioned memory leaks.

Hey, Any progress on this?
What needs to change in core/textpage.cpp in order for this to work correctly?

sander added a subscriber: sander.Mar 17 2022, 12:33 PM

Step 1 would be to open a merge request at https://invent.kde.org/graphics/okular/-/merge_requests . Can you do that?

Hey,
So I sent a merge request for this, but the issue is now that clazy formatting in the ci/cd of the build requires these too lines be changed to something more readable:

// output a left-to-right section
for (j = i; j < origTxtOrder.length() && ! origTxtOrder.at(j)->text().isRightToLeft(); ++j) ;

and

for(j = i; j < origTxtOrder.length() && (origTxtOrder.at(j)->text().isRightToLeft() || origTxtOrder.at(j)->text() == QLatin1String(" ")); ++j);

I can't test because when I run this on my ubuntu I get that backends are missing. So I can't really change this stuff.

Can anyone help me out to turn this in to a code block?
@fahadalsaidi perhaps? Its pretty close to get merged.