[KFileMetaData] Add extractor for DSC conforming (Encapsulated) Postscript
ClosedPublic

Authored by bruns on Oct 28 2018, 9:37 PM.

Details

Summary

Postscript files currently fall back to the plaintext extractor due to
mimetype inheritance. This adds lots of garbage to the index and misses
any useful data in the DSC comments.

Test Plan

make && ctest

Diff Detail

Repository
R286 KFileMetaData
Branch
postscript_dsc
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 4301
Build 4319: arc lint + arc unit
bruns created this revision.Oct 28 2018, 9:37 PM
Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptOct 28 2018, 9:37 PM
Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript
bruns requested review of this revision.Oct 28 2018, 9:37 PM
bruns updated this revision to Diff 44408.Oct 29 2018, 9:11 AM

Replace QDir::Separator with "/"

pino requested changes to this revision.Oct 29 2018, 9:17 AM
pino added a subscriber: pino.

Ugh no manual parsing of PS files -- please use libspectre.

This revision now requires changes to proceed.Oct 29 2018, 9:17 AM
bruns added a comment.EditedOct 29 2018, 9:18 AM
In D16498#350286, @pino wrote:

Ugh no manual parsing of PS files -- please use libspectre.

This is not Postscript parsing, but DSC parsing - read the specification to understand the difference!

http://www.lprng.com/RESOURCES/ADOBE/5001.DSC_Spec.pdf

bruns added a comment.Oct 29 2018, 9:56 AM

@pino - please remove your change request, if you have not read the code at all ...

pino added a comment.Oct 29 2018, 10:59 AM
In D16498#350286, @pino wrote:

Ugh no manual parsing of PS files -- please use libspectre.

This is not Postscript parsing, but DSC parsing - read the specification to understand the difference!

http://www.lprng.com/RESOURCES/ADOBE/5001.DSC_Spec.pdf

An EPS file is also a PostScript file, and indeed ghostscript opens it perfectly fine.
Because of the above, libspectre perfectly handles EPS files, and the API already provides all the information that the current DscExtractor provides as well.

@pino - please remove your change request, if you have not read the code at all ...

Please tone down your attitude to something more respectful, thanks.

In D16498#350422, @pino wrote:
In D16498#350286, @pino wrote:

Ugh no manual parsing of PS files -- please use libspectre.

This is not Postscript parsing, but DSC parsing - read the specification to understand the difference!

http://www.lprng.com/RESOURCES/ADOBE/5001.DSC_Spec.pdf

An EPS file is also a PostScript file, and indeed ghostscript opens it perfectly

  • also **

Because of the above, libspectre perfectly handles EPS files, and the API already provides all the information that the current DscExtractor provides as well.

Its an additional dependency (libpectre and libgs), and also imposes a security risk - see e.g. https://nvd.nist.gov/vuln/detail/CVE-2018-11645

@pino - please remove your change request, if you have not read the code at all ...

Please tone down your attitude to something more respectful, thanks.

You started with "Ugh ..." - you comment lacks any respect ...

pino added a comment.Oct 29 2018, 12:02 PM
In D16498#350422, @pino wrote:
In D16498#350286, @pino wrote:

Ugh no manual parsing of PS files -- please use libspectre.

This is not Postscript parsing, but DSC parsing - read the specification to understand the difference!

http://www.lprng.com/RESOURCES/ADOBE/5001.DSC_Spec.pdf

An EPS file is also a PostScript file, and indeed ghostscript opens it perfectly

  • also **

Sure: that is a reason more to handle it like that.

Because of the above, libspectre perfectly handles EPS files, and the API already provides all the information that the current DscExtractor provides as well.

Its an additional dependency (libpectre and libgs),

  • libspectre is a C library, and uses only libgs
  • the rest of the ghostscript dependencies are already used in one way or another in an average KDE installation
  • okular & cantor already use libspectre for a very long time (okular a decade)

and also imposes a security risk - see e.g. https://nvd.nist.gov/vuln/detail/CVE-2018-11645

There are way worse CVEs in lower components of a modern Linux stack (say in the Linux kernel).
Also, according to that, should we drop the support in okular for PostScript files, and the support in cantor for EPS images?

@pino - please remove your change request, if you have not read the code at all ...

Please tone down your attitude to something more respectful, thanks.

You started with "Ugh ..." - you comment lacks any respect ...

Certainly this was not the case, sorry if it was not intended. But even then, that was geared towards the code, and that does not remotely justify attacking the person with "you did not read the code" (which is wrong).

bruns added a comment.Oct 29 2018, 1:46 PM
In D16498#350460, @pino wrote:
In D16498#350422, @pino wrote:
In D16498#350286, @pino wrote:

Ugh no manual parsing of PS files -- please use libspectre.

This is not Postscript parsing, but DSC parsing - read the specification to understand the difference!

http://www.lprng.com/RESOURCES/ADOBE/5001.DSC_Spec.pdf

An EPS file is also a PostScript file, and indeed ghostscript opens it perfectly

  • also **

Sure: that is a reason more to handle it like that.

Because of the above, libspectre perfectly handles EPS files, and the API already provides all the information that the current DscExtractor provides as well.

Its an additional dependency (libpectre and libgs),

  • libspectre is a C library, and uses only libgs
  • the rest of the ghostscript dependencies are already used in one way or another in an average KDE installation
  • okular & cantor already use libspectre for a very long time (okular a decade)

It may be an option to use libspectre as basis for an additional extractor, but this should not be the default, see below.

This is meant to be as simple as possible - the extractor itself is hardly more than 30 lines of code. There is definitely a use case for this extractor.

and also imposes a security risk - see e.g. https://nvd.nist.gov/vuln/detail/CVE-2018-11645

There are way worse CVEs in lower components of a modern Linux stack (say in the Linux kernel).
Also, according to that, should we drop the support in okular for PostScript files, and the support in cantor for EPS images?

There is a difference between opening a file consciously and letting it happen by chance. The extractor is run when a file is hovered by the mouse cursor or by baloo. It will be executed without the user being aware of it.

IMHO the ghostscript support should be disabled (runtime) by default in Okular, until it is run completely sandboxed.

@pino - please remove your change request, if you have not read the code at all ...

Please tone down your attitude to something more respectful, thanks.

You started with "Ugh ..." - you comment lacks any respect ...

Certainly this was not the case, sorry if it was not intended. But even then, that was geared towards the code, and that does not remotely justify attacking the person with "you did not read the code" (which is wrong).

The code clearly states it targets (E)PS DSC. A full blown PS interpreter may be able to extract more info from the file, but not without the mentioned drawbacks. Blankly stating using libspectre is better and should be used, without weighting pros and cons, does not give the impression (to me) you have evaluated it carefully.

Apology accepted.

bruns added a comment.Nov 1 2018, 6:18 PM

@pino

Please answer why you consider running a full blown postscript interpreter in an uncontrolled environment (no sandboxing, runs without user interaction) is better than 20 code lines of trivial text parsing.

pino added a comment.Nov 1 2018, 6:31 PM

Please answer why you consider running a full blown postscript interpreter in an uncontrolled environment (no sandboxing, runs without user interaction) is better than 20 code lines of trivial text parsing.

Please be patient, I've got other things that currently take my time, and I need to properly get some data to satisfty your questions.

In the meanwhile, two suggestions:

  • please try to see also other people's point of point, and not just yours; if I suggested something, then it's because, according to my knowledge/experience (which includes also maintaining okular in the past, FYI), I deem it the optimal way
  • when asking for people's opinion, again, try to use a more friendly attitude; that will certainly help getting my attention faster, rather than feeling attacked because I'm expressing technical doubts on this solution
bruns added a comment.Nov 1 2018, 6:37 PM
In D16498#352323, @pino wrote:

Please answer why you consider running a full blown postscript interpreter in an uncontrolled environment (no sandboxing, runs without user interaction) is better than 20 code lines of trivial text parsing.

Please be patient, I've got other things that currently take my time, and I need to properly get some data to satisfty your questions.

Thats completely fine for me, if you need ore time, just say so.

In the meanwhile, two suggestions:

  • please try to see also other people's point of point, and not just yours; if I suggested something, then it's because, according to my knowledge/experience (which includes also maintaining okular in the past, FYI), I deem it the optimal way
  • when asking for people's opinion, again, try to use a more friendly attitude; that will certainly help getting my attention faster, rather than feeling attacked because I'm expressing technical doubts on this solution

Not answering can also be considered impolite. I am trying to solve a real problem here. Postscript files currently pose a significant problem for baloo. I listed technical reasons why libspectre/gs is not a good solution.

bruns added a comment.Nov 8 2018, 9:52 PM

@pino - you have not answered for a week.

You have set this to "Needs Revision", which removes it from the "Needs Review" queue for everyone else.

If you wan't to see an extractor based on libspectre, thats fine, but then you have to write it.

bruns requested review of this revision.Nov 8 2018, 9:52 PM
bruns removed a reviewer: pino.
poboiko accepted this revision.Nov 14 2018, 3:50 PM

Apart from trivial comment, this looks fine. I've tested it on my setup (with bunch of (e)ps files), and randomly chosen files seems to be indexed nicely. It also reduced the size of the index by almost 50MB, because those are not indexed as plaintext anymore :)
Yet I would also vote for replacing it (eventually) with a full-featured extractor based on libspectre
(I'm not a security specialist in any way, but that CVE doesn't look too harmful, and from my point of view it's not worth to abandon full support of (E)PS because of it)

autotests/postscriptdscextractortest.cpp
27

This, and QDebug, seems to be not used anymore.

This revision is now accepted and ready to land.Nov 14 2018, 3:50 PM
This revision was automatically updated to reflect the committed changes.
poboiko added a comment.EditedNov 15 2018, 9:22 AM

It seems like you've pushed something that was not intended to be pushed (XML extractor parts)

bruns added a comment.Nov 15 2018, 2:48 PM

It seems like you've pushed something that was not intended to be pushed (XML extractor parts)

No. Phabricator is just stupid.