popplerextractor: don't try to guess the title if there isn't one.
ClosedPublic

Authored by flameeyes on Sep 27 2017, 11:34 AM.

Details

Summary

Guessing the title of a document from metadata is not a winning strategy. The textbox with the biggest font may not be the title at all, could be an author, or a barcode using barcode image fonts [1].

The added heuristics of expecting a space ("very unlikely" would be the case for most academic papers, but it is very possible for a document to just be called "Statement" or "Invoice"), and even worse rejecting any title that has the word "Microsoft" (to work around documents exported by Word) is very finnicky, as it would reject titles such as "Analysis of somethingsomething feature on Microsoft Windows", which is clearly a valid title.

Instead, don't try to be smart about the title extraction. Trust what the file says its title is, and if there is no title, live with it. This is simpler, faster and requires less code.

[1] For instance FiServ-generated creditcard statements. See https://blog.flameeyes.eu/2017/09/how-i-leaked-my-own-credit-card-number/ for details.

Diff Detail

Repository
R286 KFileMetaData
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
flameeyes created this revision.Sep 27 2017, 11:34 AM
Restricted Application added a project: Frameworks. · View Herald TranscriptSep 27 2017, 11:34 AM
Restricted Application added a subscriber: Frameworks. · View Herald Transcript
flameeyes edited the summary of this revision. (Show Details)Sep 27 2017, 11:37 AM
da awarded a token.Sep 28 2017, 4:49 PM
da rescinded a token.
da awarded a token.
ngraham edited reviewers, added: aacid; removed: Okular.
aacid resigned from this revision.Sep 29 2017, 8:58 PM

I'm not the one that wrote this code, i have zero input on whether this is a good idea or not or if it should be removed or not

All i can do is offer the 1688 pdf files i have lying around to test the difference before and after running some code.

mgallien accepted this revision.Nov 26 2017, 10:32 PM
mgallien added subscribers: vhanda, mgallien.

Sorry for me being late to review your work. I had not noticed it.
Thanks for your work.
I am all for this change.
The code you are removing is not currently covered by the automatic tests and is different from the same code in Okular (generator_pdf.cpp:705).
Even if the result of the current automatic test should not be affected by this change, could you please accept the offer from @aacid and test with some of his documents (or better with all) ?

I have another question also related to some bugfixes I have done in KFileMetaData. Do you have an idea how to trigger an update of the Baloo database since your changes may modify the title of documents people have in the Baloo database ?

@vhanda do you know if there is something to do to update Baloo database when metadata returned by KFileMetaData are changed even if the file itself did not change ?

This revision is now accepted and ready to land.Nov 26 2017, 10:32 PM

@vhanda do you know if there is something to do to update Baloo database when metadata returned by KFileMetaData are changed even if the file itself did not change ?

I'm starting to forget the code base and what all I implemented. But I don't think such a thing was implemented.

To give some history - Strigi used to do this guessing, and when I implemented KFileMetaData, I decided to do the same. I don't have an opinion on whether it is a good idea or not.

@vhanda Thanks a lot to have took time to answer. I really appreciate your help.
I took a long time to understand that fixing or modifying the extractors is not enough to get modified data when querying Baloo. I would like to fix that in the future.

Do you have any idea how to do that while still preserving the performance of Baloo and a good experience for the users ?

With regard to this diff, I hope that testing a lot of PDF could help deciding on it. I am still a bit on the side of having the same behavior in Okular and KFileMetaData.

About me it's a good feature after all. It should be simplified e.g. only when title is empty and parse only header of first page.

@flameeyes, @mgallien: Any reasons why this is not landed?

Restricted Application added a project: Baloo. · View Herald TranscriptMar 11 2018, 3:47 PM
bruns added a subscriber: bruns.Apr 15 2018, 11:12 PM

So... are we landing this?

Restricted Application added subscribers: Baloo, kde-frameworks-devel. · View Herald TranscriptMay 9 2018, 6:49 PM
bruns accepted this revision.May 9 2018, 10:01 PM

I think this is sensible, to much code for questionable results.

This revision was automatically updated to reflect the committed changes.
szafar added a subscriber: szafar.Aug 2 2018, 5:54 PM

Just wanted to chime in to say how my workflow depended on this bit of code.

I compiled the tests/dump.cpp tool and used it as a CLI PDF metadata dumper.

I then wrote this script which renames a PDF file according to its title.

I use this script to batch rename research papers I download and it works pretty well.

This removal doesn't really affect me and is probably for the better.

cfeck added a subscriber: cfeck.Aug 2 2018, 9:30 PM

I use this script to batch rename research papers I download and it works pretty well.

Wait, you don't like your PDF's named "ilm1-2007000278.pdf" or "robinson_tr156.pdf"?

Thanks for the script :)