Fuzzy filename search for Baloo
Needs ReviewPublic

Authored by michaeleden on Jun 26 2018, 9:49 PM.

Details

Reviewers
vhanda
Group Reviewers
Baloo
Summary

This is a proof of concept fuzzy search (only for filenames) based on bigrams. It's still not integrated into baloo or optimized, but I'm creating a diff to ask if I'm on the right track here or not. What's missing:

  1. Integration into baloo as a search method
  2. Ranking of fuzzy results
  3. Journal of indexed files (to effeciently see if a file's been indexed yet)
  4. Hook into baloo_file to gradually index files

Let me know what else I need or if you'd rather have a different fuzzy matching algorithm.

Diff Detail

Repository
R293 Baloo
Branch
feature/fuzzy-search
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 405
Build 405: arc lint + arc unit
michaeleden created this revision.Jun 26 2018, 9:49 PM
Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptJun 26 2018, 9:49 PM
Restricted Application added subscribers: Baloo, kde-frameworks-devel. · View Herald Transcript
michaeleden requested review of this revision.Jun 26 2018, 9:49 PM
bruns added a subscriber: bruns.Jun 26 2018, 10:01 PM

I would very much prefer a test driven approach here:

  • Write down in english prosa what you want to match
  • Create some test cases
  • Add implementation classes

Up to this point, you don't need any real database, QMap or std::map will do.
Decoupling this from the database has the benefit it is much easier to understand and much easier to debug.

@bruns decoupling makes sense, but eventually there will be a need for a db right?
And I will definitely make those test cases, but I may ask for some help on how they're done in baloo.

bruns added a comment.EditedJun 26 2018, 10:22 PM

@bruns decoupling makes sense, but eventually there will be a need for a db right?
And I will definitely make those test cases, but I may ask for some help on how they're done in baloo.

Probably there will be a DB, but currently I am quite unsure what should be stored. Starting with the test cases will definitely help to answer this question.

For the test cases, baloo uses QTest. autotests/lib/advancedqueryparsertest.cpp may serve as a starting point.

michaeleden updated this revision to Diff 36811.EditedJun 28 2018, 7:24 AM

@bruns I added some tests and a class that implements the algorithm (without a db). Let me know what you think!

@bruns should I continue with this implementation and add dbs and such? And do the tests make sense?

bruns added a comment.Jul 14 2018, 3:52 PM

I think this design has several problems:

  1. adding or renaming documents becomes very write heavy
    • consider adding e.g. foobar.png - this will add 7 bigrams. Each bigram has an associated list of matching documents. The document list is sorted. This means we have 7 read-modify-write operations.
  1. Lots of additional data
    • to be fast, we have to keep this data in memory. We also have to fetch it from disk at startup.
  1. It ties the searching algorithm to the data structure
  1. The searching may become inefficient when the data set becomes large
    • The lookup of each bigram is fast
    • You have to lookup several bigrams when the search term becomes longer
    • You have to evaluate all result sets and combine them in some way
autotests/unit/engine/fuzzysearchtest.cpp
63

The test cases above are good in general, as these test a small aspect.
Improvement:
These are not different test cases, but iterations of the same test. This calls for data driven testing, see:
http://doc.qt.io/qt-5/qttestlib-tutorial2-example.html

69

I think this test should be written in a different way:

  • check if the map has the correct size
  • check if each entry has the correct docId
  • check the terms:
for (const auto& feat : { "no", "ot", "te", "es"} ) {
    QCOMPARE(exported[feat].size(), 1); // one document
    QCOMPARE(exported[feat][0].wid, 0); // first word
    QCOMPARE(exported[feat][0].len, 5); // strlen("notes")
}
...
QCOMPARE(exported["md"][0].wid, 3);
QCOMPARE(exported["md"][0].len, 2);
119

You should have a separate test case here, or even 2 - you should test if the merging works as expected:

  • for same feats from different documents ("notes")
  • for same feat from one documents ("not_es_", "wedn_es_day", i.e. "es")
129

If you create a list of the files, you can use a loop for the insertion here.

QList<QStringList> files = {
    {"notes", "april8", "2018", "md"},
    {"notes", "wednesday", "04092018", "md"},
...
    {"LMC2200"}
}
...
for (size_t i = 0; i < files.size(); i++) {
  db.unite(FuzzySearch::features(i, files[i]));
}
131

This function itself calls for a unit test - for a given feature, return the matching documents
Also, either capture the list, or return it (then, by value, not by reference), but don't do both. I strongly prefer return by value here.

143

If you create a QSet<QStringlist> foundFiles = files[result[0]];, you can compare it with QSet<QStringList>({{"notes", "wednesday", "04092018", "md"}});
This way, it becomes obvious what you expect as output here.
To make it more obvious what the QStringList refers to, you can use using FileNameTerms = QStringList;

src/engine/CMakeLists.txt
18

Please remove the db from the patch, this should be a separate patch, after the fuzzy matcher itself is paved out.

src/engine/database.cpp
25

later ...

108

later ...

src/engine/fuzzysearch.cpp
80

You are matching for document *and* term here, I don't think thats what you want.