Make Indexing Great Again
Open, Needs TriagePublic

Description

This is a long-term plan on improving indexing and searching in Kontact. The goal is to make the index faster and reliable, reduce the scope of what we are indexing, but make sure we are indexing it correctly and in a way that will allow us to give the most meaningful results to users queries.


  • Phase 1: The Indexing Infrastructure

Indexing happens in resources and clients upon entry, and documents are sent to the server, which writes them to Xapian store. This way we ensure that everything is indexed by the time it becomes visible to clients and we can start relying on the index as an authoritative data source.

  • Merge our Xapian wrapper classes into the main library
  • Move indexing & search into Stores so that the code lives in the same place
  • Port Email/Contact/...Query classes to Akonadi::SearchQuery, which shall be passed to search stores that will then translate it to Xapian query and execute
  • Adjust Store API to be able to return Xapian::Document with indexed data that can be serialized
  • Extend protocol to allow sending serialized Xapian::Documents as part of ItemCreateJob/ItemModifyJob
  • Extend SearchManger in Server to write the documents directly to relevant Xapian stores
  • Extend Serializer plugins or the ItemCreateJob/ItemModifyJob to perform the indexing before sending command
    • how to handle partial updates, like flag change, when client does not have full payload?
  • Implement migration code to re-index everything
  • Get rid of the Indexing Agent

  • Phase 2: The Indexed Content

Emails:

  • Index only several main headers, ignore stuff like "Received", "X-*" etc.
    • Subject
    • From
    • To
    • Cc
    • Date
    • Collection
  • Index attachment names
  • Properly parse emails, look for nested body parts (inline-forwarded emails) etc, handle HTML emails
  • Index message flags

Contacts

  • Name
  • Email(s)
  • Nick name
  • Birthday
  • Anniversary
  • UID???
  • Collection

Contact Groups

  • Name
  • Emails and names of members
  • Collections

Events

  • Summary
  • Description
  • Organizer
  • Location
  • Attendees
  • Collection
  • Calculate all occurrences of an event and store only months on which it occurs and range of years within which it occurs - this will allow us to modify KOrganizer to query only for events from the currently viewed month

Notes

  • Summary
  • Body
  • Collection

  • Phase 3: Using the Index

Now that our infrastructure is fast, reliable and overall awesome, we can start making more use of it.

  • Replace ETMCalendar with a custom calendar that queries the index for events occurring within the currently displayed month, only then request Akonadi to provide those events. Should drastically speed up loading of calendar and reduce memory footprint
    • Pre-fetch previous and next month as an optimization to allow quick switching, drop months that are no longer visible to keep the memory usage small
  • Replace/complement the Search dialog in KMail with a Search screen where one-time queries can be executed and that can display results in the main message list, maybe including some statistics and offering additional search/filter options. Look at what (and how) Thunderbird or Gmail do this.
  • TBD


Additional TODOs

  • Cleanup and document Akonadi::ContactSearchTerm, Akonadi::EmailSearchTerm and Akonadi::IncidenceSearchTerm field enums
    • For example, Akonadi::EmailSearchTerm::Attachment says it searches inside of an attachment, but that's not implemented at all
  • Add Akonadi::SearchTerm::negated() that returns negated self, to allow constructs like searchQuery.addTerm(EmailSearchTerm(MessageStatus, MessageFlags::HasAttachment).negated())
dvratil created this task.Sep 15 2017, 12:10 PM
dvratil updated the task description. (Show Details)
dvratil updated the task description. (Show Details)Oct 4 2017, 11:15 AM
dvratil updated the task description. (Show Details)Oct 4 2017, 9:57 PM
dvratil updated the task description. (Show Details)Feb 17 2018, 5:41 PM
dvratil updated the task description. (Show Details)Feb 17 2018, 5:43 PM

Considerations for re-indexing infrastructure:

  • use Akonadi Migration Agent?
  • must be an interruptible background task
  • version each store separately so we can only reindex events or emails instead of having to re-index everything
dvratil updated the task description. (Show Details)Feb 17 2018, 9:02 PM
dvratil updated the task description. (Show Details)Feb 17 2018, 9:20 PM
pablos added a subscriber: pablos.Mar 21 2018, 8:35 PM

We should ensure indexing works well with extremely large number of email messages. I had to disable Thunderbird's index database (they use SQL Lite3 - I think) because every time I received a new email, TB would stop for 1/2 - 1 while it managed the index database.

I *know* Martin tends to have zillions of email messages. We can bug him! :)

dvratil moved this task from Backlog to In Progress on the KDE PIM board.Aug 7 2018, 2:27 PM
dkurz added a subscriber: dkurz.Oct 31 2018, 5:30 PM

Dear Dan, it is great to see your progress on this major item for improving Akonadi. Thank you.