popplerextractor: don't try to guess the title if there isn't one.
ClosedPublic
Actions

Authored by flameeyes on Sep 27 2017, 11:34 AM.

Details

Reviewers

mgallien
bruns
aacid

Group Reviewers

Frameworks

Commits

R286:1491c4a2d9fd: popplerextractor: don't try to guess the title if there isn't one.

Summary

Guessing the title of a document from metadata is not a winning strategy. The textbox with the biggest font may not be the title at all, could be an author, or a barcode using barcode image fonts [1].

The added heuristics of expecting a space ("very unlikely" would be the case for most academic papers, but it is very possible for a document to just be called "Statement" or "Invoice"), and even worse rejecting any title that has the word "Microsoft" (to work around documents exported by Word) is very finnicky, as it would reject titles such as "Analysis of somethingsomething feature on Microsoft Windows", which is clearly a valid title.

Instead, don't try to be smart about the title extraction. Trust what the file says its title is, and if there is no title, live with it. This is simpler, faster and requires less code.

[1] For instance FiServ-generated creditcard statements. See https://blog.flameeyes.eu/2017/09/how-i-leaked-my-own-credit-card-number/ for details.

Diff Detail

Repository

R286 KFileMetaData

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

flameeyes created this revision.Sep 27 2017, 11:34 AM

Restricted Application added a project: Frameworks. · View Herald TranscriptSep 27 2017, 11:34 AM

Restricted Application added a subscriber: Frameworks. · View Herald Transcript

flameeyes edited the summary of this revision. (Show Details)Sep 27 2017, 11:37 AM

da awarded a token.Sep 28 2017, 4:49 PM

da rescinded a token.

da awarded a token.

ngraham added reviewers: Frameworks, Okular.Sep 28 2017, 5:54 PM

ngraham edited reviewers, added: aacid; removed: Okular.

ngraham added a subscriber: ngraham.Sep 28 2017, 8:34 PM

I'm not the one that wrote this code, i have zero input on whether this is a good idea or not or if it should be removed or not

All i can do is offer the 1688 pdf files i have lying around to test the difference before and after running some code.

Sorry for me being late to review your work. I had not noticed it.
Thanks for your work.
I am all for this change.
The code you are removing is not currently covered by the automatic tests and is different from the same code in Okular (generator_pdf.cpp:705).
Even if the result of the current automatic test should not be affected by this change, could you please accept the offer from @aacid and test with some of his documents (or better with all) ?

I have another question also related to some bugfixes I have done in KFileMetaData. Do you have an idea how to trigger an update of the Baloo database since your changes may modify the title of documents people have in the Baloo database ?

@vhanda do you know if there is something to do to update Baloo database when metadata returned by KFileMetaData are changed even if the file itself did not change ?

This revision is now accepted and ready to land.Nov 26 2017, 10:32 PM

@vhanda do you know if there is something to do to update Baloo database when metadata returned by KFileMetaData are changed even if the file itself did not change ?

I'm starting to forget the code base and what all I implemented. But I don't think such a thing was implemented.

To give some history - Strigi used to do this guessing, and when I implemented KFileMetaData, I decided to do the same. I don't have an opinion on whether it is a good idea or not.

@vhanda Thanks a lot to have took time to answer. I really appreciate your help.
I took a long time to understand that fixing or modifying the extractors is not enough to get modified data when querying Baloo. I would like to fix that in the future.

Do you have any idea how to do that while still preserving the performance of Baloo and a good experience for the users ?

With regard to this diff, I hope that testing a lot of PDF could help deciding on it. I am still a bit on the side of having the same behavior in Okular and KFileMetaData.

About me it's a good feature after all. It should be simplified e.g. only when title is empty and parse only header of first page.

@flameeyes, @mgallien: Any reasons why this is not landed?

Restricted Application added a project: Baloo. · View Herald TranscriptMar 11 2018, 3:47 PM

bruns added a subscriber: bruns.Apr 15 2018, 11:12 PM

So... are we landing this?

Restricted Application added subscribers: Baloo, kde-frameworks-devel. · View Herald TranscriptMay 9 2018, 6:49 PM

I think this is sensible, to much code for questionable results.

Closed by commit R286:1491c4a2d9fd: popplerextractor: don't try to guess the title if there isn't one. (authored by flameeyes, committed by ngraham). · Explain WhyMay 10 2018, 3:03 AM

This revision was automatically updated to reflect the committed changes.

astippich mentioned this in D13685: allow metadata of tracks without album and tracknumber in database.Jun 24 2018, 8:53 AM

Just wanted to chime in to say how my workflow depended on this bit of code.

I compiled the tests/dump.cpp tool and used it as a CLI PDF metadata dumper.

I then wrote this script which renames a PDF file according to its title.

I use this script to batch rename research papers I download and it works pretty well.

This removal doesn't really affect me and is probably for the better.

In D8007#302464, @szafar wrote:

I use this script to batch rename research papers I download and it works pretty well.

Wait, you don't like your PDF's named "ilm1-2007000278.pdf" or "robinson_tr156.pdf"?

Thanks for the script :)

Revision Contents
Changeset List

			Path	Packages
M			src/extractors/popplerextractor.cpp (88 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	19974			Sep 27 2017, 11:33 AM	★	★
Diff 2	33921	97618af	R286:1491c4a2d9fd7adb075258d7eb65f4ae1d7169f4	May 10 2018, 3:03 AM	★	★

Diff 33921

View Options

src/extractors/popplerextractor.cpp