[KFileMetaData] Add extractor for DSC conforming (Encapsulated) Postscript
ClosedPublic
Actions

Authored by bruns on Oct 28 2018, 9:37 PM.

Details

Reviewers

astippich
ngraham
poboiko

Group Reviewers

Frameworks
Baloo

Commits

R286:205ed84ee213: [KFileMetaData] Add extractor for DSC conforming (Encapsulated) Postscript

Summary

Postscript files currently fall back to the plaintext extractor due to
mimetype inheritance. This adds lots of garbage to the index and misses
any useful data in the DSC comments.

Test Plan

make && ctest

Diff Detail

Repository

R286 KFileMetaData

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

bruns created this revision.Oct 28 2018, 9:37 PM

Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptOct 28 2018, 9:37 PM

Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript

bruns requested review of this revision.Oct 28 2018, 9:37 PM

Harbormaster completed remote builds in B4289: Diff 44388.Oct 28 2018, 9:37 PM

Replace QDir::Separator with "/"

Harbormaster completed remote builds in B4301: Diff 44408.Oct 29 2018, 9:11 AM

Ugh no manual parsing of PS files -- please use libspectre.

This revision now requires changes to proceed.Oct 29 2018, 9:17 AM

In D16498#350286, @pino wrote:

Ugh no manual parsing of PS files -- please use libspectre.

This is not Postscript parsing, but DSC parsing - read the specification to understand the difference!

http://www.lprng.com/RESOURCES/ADOBE/5001.DSC_Spec.pdf

@pino - please remove your change request, if you have not read the code at all ...

In D16498#350289, @bruns wrote:

In D16498#350286, @pino wrote:

Ugh no manual parsing of PS files -- please use libspectre.

This is not Postscript parsing, but DSC parsing - read the specification to understand the difference!

http://www.lprng.com/RESOURCES/ADOBE/5001.DSC_Spec.pdf

An EPS file is also a PostScript file, and indeed ghostscript opens it perfectly fine.
Because of the above, libspectre perfectly handles EPS files, and the API already provides all the information that the current DscExtractor provides as well.

In D16498#350325, @bruns wrote:

@pino - please remove your change request, if you have not read the code at all ...

Please tone down your attitude to something more respectful, thanks.

In D16498#350422, @pino wrote:

In D16498#350289, @bruns wrote:

In D16498#350286, @pino wrote:

Ugh no manual parsing of PS files -- please use libspectre.

This is not Postscript parsing, but DSC parsing - read the specification to understand the difference!

http://www.lprng.com/RESOURCES/ADOBE/5001.DSC_Spec.pdf

An EPS file is also a PostScript file, and indeed ghostscript opens it perfectly

also **

Because of the above, libspectre perfectly handles EPS files, and the API already provides all the information that the current DscExtractor provides as well.

Its an additional dependency (libpectre and libgs), and also imposes a security risk - see e.g. https://nvd.nist.gov/vuln/detail/CVE-2018-11645

In D16498#350325, @bruns wrote:

@pino - please remove your change request, if you have not read the code at all ...

Please tone down your attitude to something more respectful, thanks.

You started with "Ugh ..." - you comment lacks any respect ...

In D16498#350426, @bruns wrote:

In D16498#350422, @pino wrote:

In D16498#350289, @bruns wrote:

In D16498#350286, @pino wrote:

Ugh no manual parsing of PS files -- please use libspectre.

This is not Postscript parsing, but DSC parsing - read the specification to understand the difference!

http://www.lprng.com/RESOURCES/ADOBE/5001.DSC_Spec.pdf

An EPS file is also a PostScript file, and indeed ghostscript opens it perfectly

also **

Sure: that is a reason more to handle it like that.

Because of the above, libspectre perfectly handles EPS files, and the API already provides all the information that the current DscExtractor provides as well.

Its an additional dependency (libpectre and libgs),

libspectre is a C library, and uses only libgs
the rest of the ghostscript dependencies are already used in one way or another in an average KDE installation
okular & cantor already use libspectre for a very long time (okular a decade)

and also imposes a security risk - see e.g. https://nvd.nist.gov/vuln/detail/CVE-2018-11645

There are way worse CVEs in lower components of a modern Linux stack (say in the Linux kernel).
Also, according to that, should we drop the support in okular for PostScript files, and the support in cantor for EPS images?

In D16498#350325, @bruns wrote:

@pino - please remove your change request, if you have not read the code at all ...

Please tone down your attitude to something more respectful, thanks.

You started with "Ugh ..." - you comment lacks any respect ...

Certainly this was not the case, sorry if it was not intended. But even then, that was geared towards the code, and that does not remotely justify attacking the person with "you did not read the code" (which is wrong).

In D16498#350460, @pino wrote:

In D16498#350426, @bruns wrote:

In D16498#350422, @pino wrote:

In D16498#350289, @bruns wrote:

In D16498#350286, @pino wrote:

Ugh no manual parsing of PS files -- please use libspectre.

This is not Postscript parsing, but DSC parsing - read the specification to understand the difference!

http://www.lprng.com/RESOURCES/ADOBE/5001.DSC_Spec.pdf

An EPS file is also a PostScript file, and indeed ghostscript opens it perfectly

also **

Sure: that is a reason more to handle it like that.

Because of the above, libspectre perfectly handles EPS files, and the API already provides all the information that the current DscExtractor provides as well.

Its an additional dependency (libpectre and libgs),

libspectre is a C library, and uses only libgs

the rest of the ghostscript dependencies are already used in one way or another in an average KDE installation

okular & cantor already use libspectre for a very long time (okular a decade)

It may be an option to use libspectre as basis for an additional extractor, but this should not be the default, see below.

This is meant to be as simple as possible - the extractor itself is hardly more than 30 lines of code. There is definitely a use case for this extractor.

and also imposes a security risk - see e.g. https://nvd.nist.gov/vuln/detail/CVE-2018-11645

There are way worse CVEs in lower components of a modern Linux stack (say in the Linux kernel).
Also, according to that, should we drop the support in okular for PostScript files, and the support in cantor for EPS images?

There is a difference between opening a file consciously and letting it happen by chance. The extractor is run when a file is hovered by the mouse cursor or by baloo. It will be executed without the user being aware of it.

IMHO the ghostscript support should be disabled (runtime) by default in Okular, until it is run completely sandboxed.

In D16498#350325, @bruns wrote:

@pino - please remove your change request, if you have not read the code at all ...

Please tone down your attitude to something more respectful, thanks.

You started with "Ugh ..." - you comment lacks any respect ...

Certainly this was not the case, sorry if it was not intended. But even then, that was geared towards the code, and that does not remotely justify attacking the person with "you did not read the code" (which is wrong).

The code clearly states it targets (E)PS DSC. A full blown PS interpreter may be able to extract more info from the file, but not without the mentioned drawbacks. Blankly stating using libspectre is better and should be used, without weighting pros and cons, does not give the impression (to me) you have evaluated it carefully.

Apology accepted.

@pino

Please answer why you consider running a full blown postscript interpreter in an uncontrolled environment (no sandboxing, runs without user interaction) is better than 20 code lines of trivial text parsing.

In D16498#352314, @bruns wrote:

Please answer why you consider running a full blown postscript interpreter in an uncontrolled environment (no sandboxing, runs without user interaction) is better than 20 code lines of trivial text parsing.

Please be patient, I've got other things that currently take my time, and I need to properly get some data to satisfty your questions.

In the meanwhile, two suggestions:

please try to see also other people's point of point, and not just yours; if I suggested something, then it's because, according to my knowledge/experience (which includes also maintaining okular in the past, FYI), I deem it the optimal way
when asking for people's opinion, again, try to use a more friendly attitude; that will certainly help getting my attention faster, rather than feeling attacked because I'm expressing technical doubts on this solution

In D16498#352323, @pino wrote:

In D16498#352314, @bruns wrote:

Please answer why you consider running a full blown postscript interpreter in an uncontrolled environment (no sandboxing, runs without user interaction) is better than 20 code lines of trivial text parsing.

Please be patient, I've got other things that currently take my time, and I need to properly get some data to satisfty your questions.

Thats completely fine for me, if you need ore time, just say so.

In the meanwhile, two suggestions:

please try to see also other people's point of point, and not just yours; if I suggested something, then it's because, according to my knowledge/experience (which includes also maintaining okular in the past, FYI), I deem it the optimal way

when asking for people's opinion, again, try to use a more friendly attitude; that will certainly help getting my attention faster, rather than feeling attacked because I'm expressing technical doubts on this solution

Not answering can also be considered impolite. I am trying to solve a real problem here. Postscript files currently pose a significant problem for baloo. I listed technical reasons why libspectre/gs is not a good solution.

@pino - you have not answered for a week.

You have set this to "Needs Revision", which removes it from the "Needs Review" queue for everyone else.

If you wan't to see an extractor based on libspectre, thats fine, but then you have to write it.

bruns requested review of this revision.Nov 8 2018, 9:52 PM

bruns removed a reviewer: pino.

Apart from trivial comment, this looks fine. I've tested it on my setup (with bunch of (e)ps files), and randomly chosen files seems to be indexed nicely. It also reduced the size of the index by almost 50MB, because those are not indexed as plaintext anymore :)
Yet I would also vote for replacing it (eventually) with a full-featured extractor based on libspectre
(I'm not a security specialist in any way, but that CVE doesn't look too harmful, and from my point of view it's not worth to abandon full support of (E)PS because of it)

autotests/postscriptdscextractortest.cpp
28	This, and QDebug, seems to be not used anymore.

This revision is now accepted and ready to land.Nov 14 2018, 3:50 PM

Closed by commit R286:205ed84ee213: [KFileMetaData] Add extractor for DSC conforming (Encapsulated) Postscript (authored by bruns). · Explain WhyNov 15 2018, 12:09 AM

This revision was automatically updated to reflect the committed changes.

It seems like you've pushed something that was not intended to be pushed (XML extractor parts)

In D16498#359676, @poboiko wrote:

It seems like you've pushed something that was not intended to be pushed (XML extractor parts)

No. Phabricator is just stupid.

Revision Contents
Changeset List

		Path
M		autotests/CMakeLists.txt (14 lines)
A	M	autotests/postscriptdscextractortest.h (36 lines)
A	M	autotests/postscriptdscextractortest.cpp (73 lines)
A	M	autotests/samplefiles/test.eps (21 lines)
A	M	autotests/samplefiles/test.ps (19 lines)
M		src/extractors/CMakeLists.txt (18 lines)
A	M	src/extractors/postscriptdscextractor.h (45 lines)
A	M	src/extractors/postscriptdscextractor.cpp (121 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	44388	4582ba5		Oct 28 2018, 9:37 PM	★	★
Diff 2	44408	4582ba5	Replace QDir::Separator with "/"	Oct 29 2018, 9:11 AM	★	★
Diff 3	45492	7cc59eb	R286:205ed84ee21351554abc5bf0476c02e3987f0081	Nov 15 2018, 12:09 AM	★	★

Commit	Tree	Parents	Author	Summary	Date
35ac604641f3	9cb4e731019f	4582ba5f3d7a	Stefan Brüns	[KFileMetaData] Add extractor for DSC conforming (Encapsulated) Postscript (Show More…)	Oct 28 2018, 9:36 PM