Search function fails to find phrases split over two lines
Needs ReviewPublic

Authored by joaonetto on Mar 13 2019, 4:15 AM.

Details

Reviewers
None
Group Reviewers
Okular
Summary

As reported by users, search would not find if the text was split across multiple lines, this fixes it, I added tests to prove it, there's the PDF in the bug report for test.

BUG: 376692

Test Plan

Open PDF provided by user.
Search for United States.
Find three different ocurrences.

Diff Detail

Repository
R223 Okular
Branch
arcpatch-D19717
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 11116
Build 11134: arc lint + arc unit
joaonetto created this revision.Mar 13 2019, 4:15 AM
Restricted Application added a project: Okular. · View Herald TranscriptMar 13 2019, 4:15 AM
Restricted Application added a subscriber: okular-devel. · View Herald Transcript
joaonetto requested review of this revision.Mar 13 2019, 4:15 AM

Thanks so much for this patch, and for including a test too! It works great for me. I'll let the Okular folks handle the rest of this show. :)

This comment was removed by joaonetto.
aacid added a subscriber: aacid.Mar 26 2019, 11:30 PM

I'm not very convinced with the tests.

You have \n but the rects are all one after the other, so the algorithm that does the line partition doesn't really see a line there, you should have them in different row rects so that they are actually different physical lines.

The "Across Lines not find" doesn't really test what you want, you "want" to test that it finds the last entry, but since the only thing you're checking is that it finds stuff and not in which rects it finds them, it may very well be finding "ab\na"

Same thing for Hyphen doesn't count

Am i making sense?

core/textpage.cpp
867

const

986

const

Are you okay with the method?

I'll take a deeper look at the tests ASAP.

Are you okay with the method?

I don't know much about the searching code, that's why i want nice tests that prove the new code works :)

joaonetto updated this revision to Diff 54905.EditedMar 26 2019, 11:57 PM
joaonetto marked 2 inline comments as done.
  • Well, now it proves it doesn't find anything, is this sufficient?

It shouldn't find anything, and it doesn't.
If not, I'll take a deeper look when I get home, this was just a top of my head solution.

joaonetto updated this revision to Diff 54906.Mar 27 2019, 12:02 AM

Doesn't find anything no matter what direction.

Ping on this.

There's a downside to this is that now if you actually put a newline character in the search it will fail where previously it worked.

On the other hand typing an actual newline character is kind of hard (i had to copy it from a newline in kate) so maybe we can just accept that noone really knew how to do that :D

Opinions?

core/textpage.cpp
865–867

strNotUsed seems like a weird variable name given you actually use it the two following names.

Would something like origStr or something make more sense?

866

given that you're modifying this line, make len const :)

Also are you totally sure this needs to be over strNotUsed and not str? Hmmm i guess it works both ways since str is the same length as strNotUsed

joaonetto marked an inline comment as done.Apr 22 2019, 11:46 AM

In my opinion is better to have it this way, I guess most users don't know about the newline.

Do you have an opinion in this @ngraham ?

core/textpage.cpp
865–867

Yes, I will make it origStr, it is better.

Unfortunately curEntity->text() needs a const variable.

I would put it in str, but it would mess up the stringLenghtAdaptedWithHyphen.

866

I need it to be strNotUsed because I'm making str const, and I'm replacing the '\n' as soon as I create it.

It would mess up the stringLenghtAdaptedWithHyphen because it would have the "-\n" text entities as " "

There's a downside to this is that now if you actually put a newline character in the search it will fail where previously it worked.

On the other hand typing an actual newline character is kind of hard (i had to copy it from a newline in kate) so maybe we can just accept that noone really knew how to do that :D

Opinions?

The use case to insert a newline could be:
Someone is testing a LaTeX document, and searches for a passage from the source code (maybe some verbatim code).
One would just copy the passage in Kate and paste it into the search box in Okular.

joaonetto marked 4 inline comments as done.EditedApr 22 2019, 1:16 PM

There's a downside to this is that now if you actually put a newline character in the search it will fail where previously it worked.

On the other hand typing an actual newline character is kind of hard (i had to copy it from a newline in kate) so maybe we can just accept that noone really knew how to do that :D

Opinions?

The use case to insert a newline could be:
Someone is testing a LaTeX document, and searches for a passage from the source code (maybe some verbatim code).
One would just copy the passage in Kate and paste it into the search box in Okular.

Hmmmmm, I can treat things differently if there's a new line in the search box.

Or maybe I should find a way to treat spaces the same as newline chars.

I'll have to take a look in the compare function to see how it would handle, maybe compare alphanumeric only?

If I can't do that, I'll have to treat them differently.

Also, isn't newline supposed to be used with regex search? It seems like a very specific case.

There's a downside to this is that now if you actually put a newline character in the search it will fail where previously it worked.

On the other hand typing an actual newline character is kind of hard (i had to copy it from a newline in kate) so maybe we can just accept that noone really knew how to do that :D

Opinions?

The use case to insert a newline could be:
Someone is testing a LaTeX document, and searches for a passage from the source code (maybe some verbatim code).
One would just copy the passage in Kate and paste it into the search box in Okular.

Hmmmmm, I can treat things differently if there's a new line in the search box.

Or maybe I should find a way to treat spaces the same as newline chars.

I'll have to take a look in the compare function to see how it would handle, maybe compare alphanumeric only?

If I can't do that, I'll have to treat them differently.

I think it would make sense to treat all whitespace equally. At least on paper, there are several possible methods to disperse text. (Different line spacing, different line lengths...) So it is difficult to predict whether I should/shouldn’t put a \n into the search box.

joaonetto updated this revision to Diff 56760.Apr 22 2019, 5:55 PM

Made the necessary changes to work with any type of space.

I think it would make sense to treat all whitespace equally. At least on paper, there are several possible methods to disperse text. (Different line spacing, different line lengths...) So it is difficult to predict whether I should/shouldn’t put a \n into the search box.

It should work now, can you test it?

It should work now, can you test it?

Me? Don’t have a wall plug here, so can’t compile now.

Additionally, I didn’t really look at your code. But how about QString::simplified() instead of QString::replace(QRegExp)?

It probably will work, I'll try it

Made the necessary changes to work with any type of space.

I don't think this is the way to go.

If i took the work to figure out how to type a newline in the search line i want a newline not a space to be what is searched for.

Hmmmmm. I can think of checking if the '\n' is in the search bar.

What if we limit the '\n' search to case sensitive? That can be a way, only search what is typed.

Right now I don't see exactly how we would do it, because there's no way we can identify which '\n' is typed by the user, and what '\n' is on the PDF side.

In my honest opinion we shouldn't support for '\n' search, unless it's something like regex search, but I see you disagree.

In my honest opinion we shouldn't support for '\n' search, unless it's something like regex search, but I see you disagree.

I'm not saying to support \n that is not a character, i'm speaking of the actual newline character, which already works and breaks with this patch. I think it's not great that we modify what the user choses to search, but on the other hand i see were already running normalization on it so what do i know 🤷

BTW QRegexp is "old", should be using QRegularExpression instead ideally