Drop duplicate results from contact completion to return
more relevant results. This is still limited by the indexing
side as we are unable to deduplicate easily based on the email
address itself (or merge the results in some clever way).
Details
- Reviewers
dfaure - Group Reviewers
KDE PIM - Commits
- R42:674dfa14dedc: Discard duplicate results during contact completion
Diff Detail
- Repository
- R42 Akonadi Search
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Works great. Given the number of matches is limited by m_limit, this actually returns more useful contacts than before, to the user it's not just about deduplication (libkdepim does deduplicate on top anyway).
One improvement would be to prefer matches with full name over matches without name.
I type "vkrau" and it says:
12:19:15.163 kmail2(16923/16923) org.kde.pim.akonadi_search_pim: processEnquire Match: "vkrause@kde.org" (50%), docid 7318456
12:19:15.163 kmail2(16923/16923) org.kde.pim.akonadi_search_pim: processEnquire Skipped duplicate match "vkrause@kde.org" (50%) docid 13800517
12:19:15.163 kmail2(16923/16923) org.kde.pim.akonadi_search_pim: processEnquire Match: "Volker Krause <vkrause@kde.org>" (47%), docid 1292769881
12:19:15.163 kmail2(16923/16923) org.kde.pim.akonadi_search_pim: processEnquire Skipped duplicate match "Volker Krause <vkrause@kde.org>" (47%) docid 3129445885
and I end up with just vkrause@kde.org in the completion, no name. Ah but this code returns both matches, it's libkdepim which deduplicates on top, and wrongly.
So indeed the question is whether this code should do full deduplication (like your TODO says), or if that part is for libkdepim (which should then be improved).
This also makes me wonder if the limit here is too low. I never realized I wasn't getting all matches but just a subset.
Thanks!
I think any filtering/deduplication should happen in Akonadi Search here - since we are able to store structured data (e.g. split the name and the address into two different fields), Xapian can perform clever deduplication at query time, rather than client code (libkdepim) having to do expensive address parsing for each result.
We can even return the data structured, like a tuple (name, address, relevance) to make it easier for client code to aggregate the results.
Also the code should be made asynchronous so we can query much more results and leave it up to the client to drop what they don't need.