Fix DB inconsistency due to some docterms appearing with uppercase symbols
ClosedPublic

Authored by poboiko on Mar 9 2017, 11:30 PM.

Details

Summary

I've noted that on some PDF files, "balooshow -x file.pdf" segfaulted. Backtrace showed that it crashed due to having single "X" term (see line 201). Moreover, it actually had a bunch of terms containing uppercase symbols (which should never occur, all the search terms are lowercase and uppercase is reserved for metadata).
Further investigation showed that pdf file (after extraction) contained exotic unicode symbols (ex.: "๐ป๐‘’๐‘‘๐‘”๐‘’"). After casting toLower(), that string remained the same; and after normalization it became "Hedge", and with that uppercase symbols it went right to DB.

Test Plan

I've tested it on affected file; "balooshow -x" no longer crashes and no longer contains uppercase terms.

Probably one can add additional check for "balooctl checkDb" command for that problematic case.
I can prepare a separate patch, if necessary.

Diff Detail

Repository
R293 Baloo
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
poboiko created this revision.Mar 9 2017, 11:30 PM
poboiko added reviewers: pinakahuja, vhanda.
vhanda accepted this revision.Mar 19 2017, 11:02 AM

This is awesome. Good work.

Ship it! (If you don't have commit access, please ask for it, you can add me as a reference)

(Optionally, one could even add a unit test)

This revision is now accepted and ready to land.Mar 19 2017, 11:02 AM
mwolff accepted this revision.Mar 19 2017, 12:59 PM
mwolff added a subscriber: mwolff.

do you have commit rights? otherwise someone from us can commit this for you

This revision was automatically updated to reflect the committed changes.