I've noted that on some PDF files, "balooshow -x file.pdf" segfaulted. Backtrace showed that it crashed due to having single "X" term (see line 201). Moreover, it actually had a bunch of terms containing uppercase symbols (which should never occur, all the search terms are lowercase and uppercase is reserved for metadata).
Further investigation showed that pdf file (after extraction) contained exotic unicode symbols (ex.: "๐ป๐๐๐๐"). After casting toLower(), that string remained the same; and after normalization it became "Hedge", and with that uppercase symbols it went right to DB.
Details
Details
- Reviewers
pinakahuja mwolff vhanda - Commits
- R293:ca4028aed27b: Fixed normalization/toLower order
I've tested it on affected file; "balooshow -x" no longer crashes and no longer contains uppercase terms.
Probably one can add additional check for "balooctl checkDb" command for that problematic case.
I can prepare a separate patch, if necessary.
Diff Detail
Diff Detail
- Repository
- R293 Baloo
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Comment Actions
This is awesome. Good work.
Ship it! (If you don't have commit access, please ask for it, you can add me as a reference)
(Optionally, one could even add a unit test)