Overhaul Baloo database scheme
Open, NormalPublic

Description

Problem statement

While most aspects of the scheme are working well, there are a few parts which are problematic. As all these require changes to the DB, support for all items should be preferably done with a single database version bump/migration.

document ID

The document ID is a 64 bit integer, used to represent each file/document/directory uniquely. Currently, it is created as a concatenation of the inodes st_dev and st_ino structure fiels (man 7 inode, stat.st_dev and stat.st_ino). From each field, the lower 32 bit are used, and used as high and low half.

Problems with this scheme:

  • Contemporary file systems use 64 bit inode numbers (XFS, BTRFS, EXT4 (optionally))
  • st_dev is not stable. For disks, it uses the devices major/minor number, i.e. moving the disk to a different controller will change the device. For network FSs and disk images, it is not stable at all and depends on the mount order.
  • The inode bits are put in the high part (endian dependent? at least for LE). documentIDs are in some places coded differentially for storage efficiency (e.g. by the document terms db), which does not work well as the mostly static part (device id) is in the low bits and the differing part (inode) in the high bits, i.e. each difference is at least 2^32.

position db

For each (search) term not only the documentID is stored, but also the position (also the frequency / number of occurences is calculated, but not stored).

Example:
The quick brown fox jumps over the not so quick dog
'the' : [1, 7]
'quick' : [2, 10]
'brown': [3]
...

The DB entries are keyed by term, the value is composed as [ `documentID : number of positions : [positions] ] , concatenated for all matching documents.

E.g. adding one document with only 5 terms requires rewriting 5 DB entries, each a costly read-decode-insert-encode-write cycle. Large documents with many terms can trigger a rewrite of a significant portion of the DB.

extractor tracking

Each time a new extractor is added or enhanced (possibly extracting new data), the DB only contains the old data. (T8079).

Solution proposal

document ID

  • use one database file per device
    • store the db file either on the same device as the tracked files
    • or/and
    • store an identifier per tracked device, e.g the filesystem UUID
  • only store the inode inside the DB (64 bit)

This scheme also allows storing indexes on removable media or inside encrypted containers.

position db

a) remove the position DB alltogether
or
b) use the document ID as key, store terms and positions as value

Regarding (a), the position DB is hardly used. The match decision is already possible using the terms DB only. For filenames, "phrase" matches are still possible as filenames are stored verbatim.

(b) avoids the RMW cycles when documents are changed. The storage size stays the same. Matching is much simpler, as the first matching step of the andPositionIterator can be replaced by a single database lookup.

Extractors

  • assign a version number to each extractor
  • store the used version on first use
  • on version increase, reindex all affected files
  • update the stored version
bruns created this task.Oct 6 2018, 6:13 PM
bruns triaged this task as Normal priority.
poboiko added a subscriber: poboiko.Oct 7 2018, 9:04 AM

I would like to comment on that and add my 2 cents:

  • after full reindexing (at least in my case) almost 30% of the index is freelist (difference between "Actual Size" and "Expected Size"), i.e. empty pages in DB that can be used later to add new entries.
> balooctl indexSize
Actual Size: 544,08 MiB
Expected Size: 384,92 MiB

           PostingDB:     105,41 MiB    27.385 %
          PositionDB:     165,20 MiB    42.919 %
            DocTerms:      49,71 MiB    12.916 %
    DocFilenameTerms:       1,17 MiB     0.303 %
       DocXattrTerms:       4,00 KiB     0.001 %
              IdTree:     196,00 KiB     0.050 %
          IdFileName:     932,00 KiB     0.236 %
             DocTime:     512,00 KiB     0.130 %
             DocData:       1,61 MiB     0.417 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:     428,00 KiB     0.109 %

See relevant discussion ("Disk Reclamation") and the answer from LMDB developer:

Disk Reclamation - In our experience of 15 years of OpenLDAP deployments, databases never shrink. Any time you delete a record you will probably be adding a new one shortly thereafter.

This is not our case: after full reindexing, we will hardly add lots of new data. It would be nice to find a way to reclaim it, or populate DB in a way so it does not happen (AFAIK those appear mainly when you delete entries).

  • DocTermsDB stores DocID -> List of terms mapping, but I didn't find a place where it is used (apart from balooshow, which is hardly a reason to keep it).
  • PositionDB does take a significant part of the index, but phrase search is a nice feature, it would be nice to know if users don't really use it before removing it.

As for me, (b) solution here is better.
Maybe we can also make it optional? (like content indexing checkbox in settings)

BTW. Is there any reason why i.e. inside PostingDB Baloo stores key -> encoded list of values, instead of using MDB_DUPSORT | MDB_DUPFIXED (which allow to store multiple values for keys)?

Now in order to add a pair (document + term) we need to:

  1. Fetch list of all documents with term
  2. Insert an ID there in a sorted order
  3. Put it all back in DB

Because of that:

  • For large index, that makes transactions unnecessarily large (might be related to memory usage issues?)
  • I think that's also exactly what causes large (~30% of DB) freelist
  • On the other hand, with duplicated keys we might just mdb_put(term, docid) (or at some point check if it's not already there).

It seems like we also won't need PostingCodec::decode() / encode() routine, and DBPostingIterator could become zero-copy (with mdb_cursor_get(MDB_GET_MULTIPLE / MDB_NEXT_MULTIPLE) instead of fetching & decoding QByteArray to PostingList).

Are there any drawbacks that I'm missing?

bruns added a comment.Oct 31 2018, 8:22 PM

Are there any drawbacks that I'm missing?

If the docs ids where implemented correctly (device ID in the high bits) the delta encoding would save some memory.

For DUPSORT, multiple values are stored as keys (value-less) of an internal per-key database. So when inserting a value, you are inserting a key in an internal db. Using DUPSORT just shifts some work from baloo to lmdb, but the new value still has to be inserted at the right position.

  • DocTermsDB stores DocID -> List of terms mapping, but I didn't find a place where it is used (apart from balooshow, which is hardly a reason to keep it).
  • PositionDB does take a significant part of the index, but phrase search is a nice feature, it would be nice to know if users don't really use it before removing it. As for me, (b) solution here is better. Maybe we can also make it optional? (like content indexing checkbox in settings)

My plan is to consolidate both DBs into one, as the information stored in PositionDB is partly redundant (term -> document lookup) and badly structured (n data decodes instead of 1 when doing phrase search).

Note, phrase search is currently not implemented (at least not for "normal terms").

I'd like to revive this discussion.

If the docs ids were implemented correctly (device ID in the high bits) the delta encoding would save some memory.

Well, as far as I understood your proposal, with one-db-per-device we don't want store device ID at all and use those bits to support 64-bit inode numbers.

For DUPSORT, multiple values are stored as keys (value-less) of an internal per-key database. So when inserting a value, you are inserting a key in an internal db. Using DUPSORT just shifts some work from baloo to lmdb, but the new value still has to be inserted at the right position.

Yep, I know. I've been thought about it a bit. Internal DB is still a B+ tree, so LMDB will take care about sorting (not sure what you meant by "value still has to be inserted at the right position").

So I'd like to emphasize again pros of this approach:

  1. We won't need RMW cycle when entering new (term+docId) pair into DB. Just a single mdb_put (W) operation will do.
  2. Which means that we won't need sortedIdInsert routine from WriteTransaction, and we can also drop PostingCodec. Maybe also PositionCodec (less code is better!)
  3. It looks we won't need to store m_pendingOperations inside WriteTransaction --- we can just put data to DB it as soon as we get it. Which I think is nice: m_pendingOperations resides on the heap, and during indexing (when processing a large batch) it can cause memory-related problems which users sometimes report.
  4. Size of freelist might reduce -- but for now that's just speculations, better do some testing.
  • DocTermsDB stores DocID -> List of terms mapping, but I didn't find a place where it is used (apart from balooshow, which is hardly a reason to keep it).

I messed up a bit, DocTermsDB is actually used when replacing or removing a document. It helps to find which data to remove from other DBs, so it's not meaningless.

My plan is to consolidate both DBs into one, as the information stored in PositionDB is partly redundant (term -> document lookup) and badly structured (n data decodes instead of 1 when doing phrase search).

This would be nice code-wise, but I'm afraid it will affect performance.
After initial AndPostingIterator match, we will need to fetch the whole document (sic!) from DB to check if terms are indeed in right positions. Which can be pretty huge, unfortunately.
AFAIR Baloo promoted itself as an indexer which places focus to maximize searching performance (with the trade-off being low performance of indexing).

As for redundancy, we can (for example) change the structure to a following:

  1. PostingDB stores mapping term -> (docId, uniquePositionId)
  2. PositionDB stores mapping uniquePositionId -> [list of positions]

(uniquePositionId can be i.e. some hash of term+docId)

bruns added a comment.Apr 4 2019, 7:55 PM

I'd like to revive this discussion.

If the docs ids were implemented correctly (device ID in the high bits) the delta encoding would save some memory.

Well, as far as I understood your proposal, with one-db-per-device we don't want store device ID at all and use those bits to support 64-bit inode numbers.

And after that, the sorted IDs would still have small deltas, so 1...2 byte per document instead of 8 byte.

For DUPSORT, multiple values are stored as keys (value-less) of an internal per-key database. So when inserting a value, you are inserting a key in an internal db. Using DUPSORT just shifts some work from baloo to lmdb, but the new value still has to be inserted at the right position.

Yep, I know. I've been thought about it a bit. Internal DB is still a B+ tree, so LMDB will take care about sorting (not sure what you meant by "value still has to be inserted at the right position").

So I'd like to emphasize again pros of this approach:

  1. We won't need RMW cycle when entering new (term+docId) pair into DB. Just a single mdb_put (W) operation will do.

Just because you hide the RMW, it does not mean it does not happen. For LMDB, duplicate keys are just a plain array, see e.g. MDB_APPENDDUP in http://www.lmdb.tech/doc/group__mdb.html#ga4fa8573d9236d54687c61827ebf8cac0.

The data entries are sorted by LMDB as soon as you do a mdb_put, and this is of course a RMW cycle.

My plan is to consolidate both DBs into one, as the information stored in PositionDB is partly redundant (term -> document lookup) and badly structured (n data decodes instead of 1 when doing phrase search).

This would be nice code-wise, but I'm afraid it will affect performance.
After initial AndPostingIterator match, we will need to fetch the whole document (sic!) from DB to check if terms are indeed in right positions. Which can be pretty huge, unfortunately.
AFAIR Baloo promoted itself as an indexer which places focus to maximize searching performance (with the trade-off being low performance of indexing).

Currently, to do a phrase match:

  1. For each searched/ANDed term the positiondb entry has to be decoded, that is a value containing all documentids with all their positions. For common terms, this is a large dataset.
  2. Then, the intersection of all term/document pairs has to be done.
  3. For each term/document, the positions have to be intersected.

1/2. could be swapped using the postingdb to optimize the positiondb decoding, allowing to prune irrelevant documents early. Still, it is a considerably complex operation.

I think you underestimate the effort of decoding the whole dataset behind each positiondb term.

As for redundancy, we can (for example) change the structure to a following:

  1. PostingDB stores mapping term -> (docId, uniquePositionId)
  2. PositionDB stores mapping uniquePositionId -> [list of positions] (uniquePositionId can be i.e. some hash of term+docId)

uniquepositionid would be comparatively large, as the number of possible terms is large. We would go from ~2...3 byte for the docId alone to maybe 10...11 byte (uniquepositionid would be more or less uncompressible). You pay the uniquepositionId cost a second time when you use it as the positiondb key. While it is neglegible when a term has many positions, it is a considerable cost for uncommon terms.

Think of how much data you need for storing the full text of a document in your scheme: all positions plus one uniquepositionid per unique term. Typically, every document has some frequent terms, but many terms only have 1 or 2 occurences. You probably end up at 5...10 bytes per word.

Now, have a look at how good contemporary compression algorithms compress plain text, you end up at less than 1 byte per word. This means it is possible to hold 10 times as much full text cached in RAM.

Now you argue - "I have to fetch less data", but that is not actually true:

Case 1, rare terms:
Most documents are already pruned using the PostingDB alone. Your final step in the PositionDB becomes faster, but you already have wasted space and effort going through the PostingDB (which is now 2 or 3 times as large as when not storing the uniquepositionid).

Case 2, frequent terms:
You have to decode larger PostingDB values and store them. For each uniquepositionid tuple you have to fetch the positions, decode it and do the check. The uniquepositionid entries are spread over the DB. As the terms are frequent, the entries are comparatively large.

If at least one term is rare, the first step already prunes most of the documents. Only if all terms are frequent, a considerable amount of effort is spent for going through the term positions.

Just because you hide the RMW, it does not mean it does not happen. For LMDB, duplicate keys are just a plain array, see e.g. MDB_APPENDDUP in http://www.lmdb.tech/doc/group__mdb.html#ga4fa8573d9236d54687c61827ebf8cac0.

The data entries are sorted by LMDB as soon as you do a mdb_put, and this is of course a RMW cycle.

I thought the structure behind MDB_DUPSORT is a bit more clever that just a plain array, i.e. it's still a search tree.
And insertion still should cost much less compared to what is done now (fetch whole list + insert + put it back).
Of course it still might do some RMW work (tree rebalancing, for example), my point is that it should be more efficient.

[...]
I think you underestimate the effort of decoding the whole dataset behind each positiondb term.

That is exactly the point. Right now a single PositionDB entry is most likely much larger that any single document. It costs a lot to load/decode it, it costs a lot to replace it, it eats a lot of memory, it's design is quite suboptimal.

What I propose is that single PositionDB entry will store positions of a given term inside a given document. It would be a rather small list, and I think it is just the smallest possible chunk of data that is needed to perform matching.

uniquepositionid would be comparatively large, as the number of possible terms is large. We would go from ~2...3 byte for the docId alone to maybe 10...11 byte (uniquepositionid would be more or less uncompressible). You pay the uniquepositionId cost a second time when you use it as the positiondb key. While it is neglegible when a term has many positions, it is a considerable cost for uncommon terms.
[...]

That's a fair point. I guess we can add some heuristics for uncommon terms. For example, if the term appears only once or twice inside a document, we can store it's position directly in uniquePositionId, so less memory would be wasted.
For example, if uniquePositionId is 64bit, a single bit should be wasted to tell if we should go to PositionDB or not, and, for example, 3 bytes per position (2^24 = 16777216 should be enough) will give us enough memory to store 2 positions. And no PositionDB entry. That should at least cover rare terms problem.


BTW, I gathered some numbers (relevant for my computer - but I have nothing special. just bunch of documents, books, photos...). The index is not that large though:

        PostingDB:     105,01 MiB    26.776 %
       PositionDB:     168,34 MiB    42.925 %
         DocTerms:      52,17 MiB    13.302 %
 DocFilenameTerms:       1,04 MiB     0.265 %
    DocXattrTerms:       4,00 KiB     0.001 %
           IdTree:     180,00 KiB     0.045 %
       IdFileName:     788,00 KiB     0.196 %
          DocTime:     472,00 KiB     0.118 %
          DocData:     564,00 KiB     0.140 %
ContentIndexingDB:            0 B     0.000 %
      FailedIdsDB:       4,00 KiB     0.001 %
          MTimeDB:     332,00 KiB     0.083 %

I've counted amount of unqiue (term+docId) pairs (using PositionDB::toTestMap() routine), where term appears {1,2,3,4,5} times in a document. That would correspond to amount of proposed PostingDB entries of size {1,2,3,4,5}.

3002357 - 1
581909 - 2
244146 - 3
145762 - 4
94318 - 5

At least in my case, proposed heuristic will work quite well (1+2 cover ~85% of all entries!)

bruns added a comment.Apr 5 2019, 10:17 PM

Just because you hide the RMW, it does not mean it does not happen. For LMDB, duplicate keys are just a plain array, see e.g. MDB_APPENDDUP in http://www.lmdb.tech/doc/group__mdb.html#ga4fa8573d9236d54687c61827ebf8cac0.

The data entries are sorted by LMDB as soon as you do a mdb_put, and this is of course a RMW cycle.

I thought the structure behind MDB_DUPSORT is a bit more clever that just a plain array, i.e. it's still a search tree.
And insertion still should cost much less compared to what is done now (fetch whole list + insert + put it back).
Of course it still might do some RMW work (tree rebalancing, for example), my point is that it should be more efficient.

You can lookup an entry an a sorted array in log(N) time, it will not get any faster than that. But even a tree has to be written back to disk. When the tree has some holes, you only have to fetch a single page, insert at the right position, and write the page back. But the density goes down significantly - 8 bytes per document, and some holes or you end up balancing all the time, which causes write amplification.

[...]
I think you underestimate the effort of decoding the whole dataset behind each positiondb term.

That is exactly the point. Right now a single PositionDB entry is most likely much larger that any single document. It costs a lot to load/decode it, it costs a lot to replace it, it eats a lot of memory, it's design is quite suboptimal.

What I propose is that single PositionDB entry will store positions of a given term inside a given document. It would be a rather small list, and I think it is just the smallest possible chunk of data that is needed to perform matching.

Overhead! Every time you chunk the data into smaller blocks you have to add metadata.

uniquepositionid would be comparatively large, as the number of possible terms is large. We would go from ~2...3 byte for the docId alone to maybe 10...11 byte (uniquepositionid would be more or less uncompressible). You pay the uniquepositionId cost a second time when you use it as the positiondb key. While it is neglegible when a term has many positions, it is a considerable cost for uncommon terms.
[...]

That's a fair point. I guess we can add some heuristics for uncommon terms. For example, if the term appears only once or twice inside a document, we can store it's position directly in uniquePositionId, so less memory would be wasted.
For example, if uniquePositionId is 64bit, a single bit should be wasted to tell if we should go to PositionDB or not, and, for example, 3 bytes per position (2^24 = 16777216 should be enough) will give us enough memory to store 2 positions. And no PositionDB entry. That should at least cover rare terms problem.

I've counted amount of unqiue (term+docId) pairs (using `PositionDB::toTestMap()` routine), where term appears {1,2,3,4,5} times in a document. That would correspond to amount of proposed `PostingDB` entries of size {1,2,3,4,5}.

3002357 - 1
581909 - 2
244146 - 3
145762 - 4
94318 - 5

You are missing the entry for "0" - you still have to store the 64bit ID, even if a term is not in the positionDB. All property, filename, tag, ... terms are only in the PostingDB, not in the PositionDB. You are blowing up the PostingDB to get the PositionDB smaller.

I didn't realize LMDB does not modify pages; and if we change one, it creates a new one, copies the data from the old page, modifies it and marks old one as "dirty".
In that case we'll most likely end up modifying a single page in both cases.

You are missing the entry for "0" - you still have to store the 64bit ID, even if a term is not in the positionDB. All property, filename, tag, ... terms are only in the PostingDB, not in the PositionDB.

I've checked it (sum of all postingDb[term].size() where term does not appear in positionDb), it appears to be quite negligible in my case (the number is 31223).

You are blowing up the PostingDB to get the PositionDB smaller

What I'm really trying to do here is:

  1. Get rid of the whole m_pendingOperations routine inside WriteTransaction. It stores almost a whole transaction on the heap. During content indexing, if we're dealing with a batch of 50 more or less large text documents, it can blow up. And I think that's exactly what happens in most of those "baloo eats too much memory" complaints. It doesn't make any sense with memory-mapped DB - we still end up eating lots of memory that cannot be efficiently reclaimed by the OS. I'm trying to think of an efficient way of putting those data to DB right after we get it from extractor, and let LMDB take care of it.
  2. Make Phrase-search only fetch the data that is actually needed to perform the search - and thus (allegedly) improving phrase matching performance. Current implementation is bad - it fetches all documents + all positions for a single term, it is most likely to be huge. Fetching a whole document can be pretty huge as well - just think of some common term and searching through a library with bunch of 50k-100k-words books. That might be painful, maybe even almost as painful as just grep'ping through the document.

Unfortunately, it won't reduce amount of i/o operations per document during indexing.
That is a drawback of having a B+-tree-based index, we traded write efficiency for read efficiency...


Actually, if uniquePositionId would be some known hash-function of (docId, term) pair, we won't need to store it inside PostingDB at all. Probably we'll still have to store something there - just to make sure we can deal with hash collisions - but 1 byte might be sufficient. In that case PostingDB can still be delta-encoded efficiently, and the only overhead we'll have to pay is uniquePositionId key (plus metadata) inside PositionDB.

hyc added a subscriber: hyc.Apr 10 2019, 2:32 PM

Fwiw, @poboiko is correct here.

Just because you hide the RMW, it does not mean it does not happen. For LMDB, duplicate keys are just a plain array, see e.g. MDB_APPENDDUP in http://www.lmdb.tech/doc/group__mdb.html#ga4fa8573d9236d54687c61827ebf8cac0

Nothing on that link implies use of an array anywhere.

The data entries are sorted by LMDB as soon as you do a mdb_put, and this is of course a RMW cycle.

I thought the structure behind MDB_DUPSORT is a bit more clever that just a plain array, i.e. it's still a search tree.
And insertion still should cost much less compared to what is done now (fetch whole list + insert + put it back).
Of course it still might do some RMW work (tree rebalancing, for example), my point is that it should be more efficient.

Correct, duplicate values are stored in a sub-Btree under the key, and insertion into a tree is much more efficient than
insertion into a sorted array, particularly as the number of elements increases.

bruns added a comment.Apr 10 2019, 7:27 PM
In T9805#181517, @hyc wrote:

Fwiw, @poboiko is correct here.

Just because you hide the RMW, it does not mean it does not happen. For LMDB, duplicate keys are just a plain array, see e.g. MDB_APPENDDUP in http://www.lmdb.tech/doc/group__mdb.html#ga4fa8573d9236d54687c61827ebf8cac0

Nothing on that link implies use of an array anywhere.

MDB_GET_MULTIPLE returns "up to a page of duplicate data items from current cursor position", so apparently the values for DUPFIXED are packed, or is the page created during the mdb_cursor_get call?

Btw, how does one determine how many items in the page are actully valid? Same as for mdb_cursor_put(..., MDB_MULTIPLE), i.e. pair of MDB_val for data?

The data entries are sorted by LMDB as soon as you do a mdb_put, and this is of course a RMW cycle.

I thought the structure behind MDB_DUPSORT is a bit more clever that just a plain array, i.e. it's still a search tree.
And insertion still should cost much less compared to what is done now (fetch whole list + insert + put it back).
Of course it still might do some RMW work (tree rebalancing, for example), my point is that it should be more efficient.

Correct, duplicate values are stored in a sub-Btree under the key, and insertion into a tree is much more efficient than
insertion into a sorted array, particularly as the number of elements increases.

As soon as your list of secondary keys extends beyond a page size, you can of course be somewhat more efficient, by inserting a new page inbetween, or rewriting just the affected page when deleting an item.

But please answer the following questions:

  1. How large is the overhead if most of the keys have just one, or a few secondary keys? (Primary key is a variable length blob, secondary key is a 64bit integer, no associated data).
  2. How many 64bit secondary values can be saved in one page?
  3. What happens if many keys are deleted from adjacent pages, are these leave pages collapsed?
  4. How does the B-Tree compare when e.g. 5 random secondary keys are removed and 5 new ones are inserted? (5 times mdb_del followed by 5 times mdb_put).
hyc added a comment.EditedApr 11 2019, 1:10 AM
In T9805#181560, @bruns wrote:
In T9805#181517, @hyc wrote:

Fwiw, @poboiko is correct here.

Just because you hide the RMW, it does not mean it does not happen. For LMDB, duplicate keys are just a plain array, see e.g. MDB_APPENDDUP in http://www.lmdb.tech/doc/group__mdb.html#ga4fa8573d9236d54687c61827ebf8cac0

Nothing on that link implies use of an array anywhere.

MDB_GET_MULTIPLE returns "up to a page of duplicate data items from current cursor position", so apparently the values for DUPFIXED are packed, or is the page created during the mdb_cursor_get call?

You're mixing very separate cases. DUPFIXED is a special case of DUPSORT and yes, for DUPFIXED all of the items are packed in an array on a page with no per-item overhead.

Btw, how does one determine how many items in the page are actully valid? Same as for mdb_cursor_put(..., MDB_MULTIPLE), i.e. pair of MDB_val for data?

The MDB_val for the data contains a size. Divide by the size of a single DUPFIXED item to get the number of items returned in that page.

The data entries are sorted by LMDB as soon as you do a mdb_put, and this is of course a RMW cycle.

I thought the structure behind MDB_DUPSORT is a bit more clever that just a plain array, i.e. it's still a search tree.
And insertion still should cost much less compared to what is done now (fetch whole list + insert + put it back).
Of course it still might do some RMW work (tree rebalancing, for example), my point is that it should be more efficient.

Correct, duplicate values are stored in a sub-Btree under the key, and insertion into a tree is much more efficient than
insertion into a sorted array, particularly as the number of elements increases.

As soon as your list of secondary keys extends beyond a page size, you can of course be somewhat more efficient, by inserting a new page inbetween, or rewriting just the affected page when deleting an item.

But please answer the following questions:

  1. How large is the overhead if most of the keys have just one, or a few secondary keys? (Primary key is a variable length blob, secondary key is a 64bit integer, no associated data).

There are multiple answers to this question, depending on DUPFIXED config and data sizes.

If all of the secondary keys are the same size, then you can use DUPFIXED, which as already noted has no per-item overhead.
If you don't use DUPFIXED there's an 8 byte metadata per item.

If the key has only 1 value, there is no difference from any non-DUPSORT record. If the key has a small number of values greater than 1, there is 48 bytes of overhead for the sub-btree header. If the key has a large number of values they simply get put onto their own page(s) as regular B+tree nodes.

  1. How many 64bit secondary values can be saved in one page?

Every page has a 16 byte header. On a 4096 byte page that leaves you with 4080 bytes for user data.

  1. What happens if many keys are deleted from adjacent pages, are these leave pages collapsed?

Yes of course, this is fundamental to B+tree operation.

  1. How does the B-Tree compare when e.g. 5 random secondary keys are removed and 5 new ones are inserted? (5 times mdb_del followed by 5 times mdb_put).

I don't understand the question.

bruns added a comment.Apr 11 2019, 1:58 AM
In T9805#181576, @hyc wrote:
In T9805#181560, @bruns wrote:
In T9805#181517, @hyc wrote:

Fwiw, @poboiko is correct here.

Just because you hide the RMW, it does not mean it does not happen. For LMDB, duplicate keys are just a plain array, see e.g. MDB_APPENDDUP in http://www.lmdb.tech/doc/group__mdb.html#ga4fa8573d9236d54687c61827ebf8cac0

Nothing on that link implies use of an array anywhere.

MDB_GET_MULTIPLE returns "up to a page of duplicate data items from current cursor position", so apparently the values for DUPFIXED are packed, or is the page created during the mdb_cursor_get call?

You're mixing very separate cases. DUPFIXED is a special case of DUPSORT and yes, for DUPFIXED all of the items are packed in an array on a page with no per-item overhead.

For the use case, no. The proposal is to replace "Key + variable sized blob, no DUP", with "Key + fixed size secondary". So DUPFIXED aplies here.

Btw, how does one determine how many items in the page are actully valid? Same as for mdb_cursor_put(..., MDB_MULTIPLE), i.e. pair of MDB_val for data?

The MDB_val for the data contains a size. Divide by the size of a single DUPFIXED item to get the number of items returned in that page.

Yes, makes sense. For put, the second MDB_val is required as the fixed val size is not known to LMDB until the first put?

As soon as your list of secondary keys extends beyond a page size, you can of course be somewhat more efficient, by inserting a new page inbetween, or rewriting just the affected page when deleting an item.

But please answer the following questions:

  1. How large is the overhead if most of the keys have just one, or a few secondary keys? (Primary key is a variable length blob, secondary key is a 64bit integer, no associated data).

If all of the secondary keys are the same size, then you can use DUPFIXED, which as already noted has no per-item overhead.
If you don't use DUPFIXED there's an 8 byte metadata per item.

How large is the overhead of a single entry? I.e. "Blob key + blob data" in a not DUPSORTed db, vs "Blob key + fixed data" in a DUPFIXED db? Is there a per key overhead to allow DUP*?

  1. How many 64bit secondary values can be saved in one page?

Every page has a 16 byte header. On a 4096 byte page that leaves you with 4080 bytes for user data.

  1. What happens if many keys are deleted from adjacent pages, are these leave pages collapsed?

Yes of course, this is fundamental to B+tree operation.

  1. How does the B-Tree compare when e.g. 5 random secondary keys are removed and 5 new ones are inserted? (5 times mdb_del followed by 5 times mdb_put).

I don't understand the question.

Typical operation is to delete a few thousand entries (e.g. 2000 primary keys each with 1..5 secondary keys) and insert again 1...5 new secondary keys into the same primary keys again. So an update of the key data actually (which requires sorted insert and rebalancing eventually). This happens in a single transaction.

Does it make a difference if the updates are sorted by the primary key or the secondary key in the update? I.e:

  • Variant A, on Key X, update data a to data m, update data b to data p, ..., on Key Y, update a, b, ..., Key Z, Key XX, Key XY, ...
  • Variant B, on Key X, update data a, Key Y update a, Key Z, XX, XY update a; Key X, Y, Z update b, Key X ...

When such a transaction happens, how many pages are in use? Are multiple updates affecting one page contained to this page, or does this require allocating new pages?

hyc added a comment.Apr 11 2019, 6:54 AM
In T9805#181577, @bruns wrote:
In T9805#181576, @hyc wrote:

The MDB_val for the data contains a size. Divide by the size of a single DUPFIXED item to get the number of items returned in that page.

Yes, makes sense. For put, the second MDB_val is required as the fixed val size is not known to LMDB until the first put?

Correct.

But please answer the following questions:

  1. How large is the overhead if most of the keys have just one, or a few secondary keys? (Primary key is a variable length blob, secondary key is a 64bit integer, no associated data).

If all of the secondary keys are the same size, then you can use DUPFIXED, which as already noted has no per-item overhead.
If you don't use DUPFIXED there's an 8 byte metadata per item.

How large is the overhead of a single entry? I.e. "Blob key + blob data" in a not DUPSORTed db, vs "Blob key + fixed data" in a DUPFIXED db?

I already answered this directly above your question.

Is there a per key overhead to allow DUP*?

No.

  1. How does the B-Tree compare when e.g. 5 random secondary keys are removed and 5 new ones are inserted? (5 times mdb_del followed by 5 times mdb_put).

I don't understand the question.

Typical operation is to delete a few thousand entries (e.g. 2000 primary keys each with 1..5 secondary keys) and insert again 1...5 new secondary keys into the same primary keys again. So an update of the key data actually (which requires sorted insert and rebalancing eventually). This happens in a single transaction.

Does it make a difference if the updates are sorted by the primary key or the secondary key in the update? I.e:

  • Variant A, on Key X, update data a to data m, update data b to data p, ..., on Key Y, update a, b, ..., Key Z, Key XX, Key XY, ...
  • Variant B, on Key X, update data a, Key Y update a, Key Z, XX, XY update a; Key X, Y, Z update b, Key X ...

You'd probably get better cache utilization with variant A since you don't need to re-seek to Key X each time you modify its secondaries.

When such a transaction happens, how many pages are in use? Are multiple updates affecting one page contained to this page, or does this require allocating new pages?

A clean page is copied and made dirty only once per transaction. The same dirty page is used for all modifications to items on that page.

Update.
I've spent some time to implement structure I propose (the code is available in private clone).
Most notable changes:

  1. PostingDB is switched to DUPSORT; its values are <docId + position list> (so no DUPFIXED here)
  2. However, DUPSORT values cannot be larger than MDB_MAXKEYSIZE (511 bytes by default, IIRC), so positions of larger size go to PositionDB. Keys in PositionDB are md5sum(docId+term) + delta, where delta is single byte stored inside PostingDB to resolve hash collisions.
  3. Code-wise, in order to support a structure where both DBs has to be updated simultaneously, I've merged them into single class PostingPositionDB, which takes care of both DBs.
  4. Got rid of pendingOperations inside WriteTransaction; all indexed data go straight to DB, without fetch everything+insert+write everything operations. Actually, I merged WriteTransaction and Transaction, because now there is not much sense to separate them.
  5. Tried to adapt tests as much as possible, and added few more tests, removed old tests, as well as some of the unused code, etc., etc., etc.

So data duplication is completely gone, heap allocation during WriteTransaction is gone as well, as I've wanted. The code is WIP, although it passes tests / works on my machine pretty nicely.


Now for actual numbers.

First test: my machine

I've just ran both implementations on my machine (~11K documents, text mostly).

Old DB

$ balooctl indexSize
File Size: 554,01 MiB
Used:      331,59 MiB

           PostingDB:     105,73 MiB    31.884 %
          PositionDB:     169,92 MiB    51.243 %
            DocTerms:      52,63 MiB    15.870 %
    DocFilenameTerms:       1,05 MiB     0.317 %
       DocXattrTerms:       4,00 KiB     0.001 %
              IdTree:     184,00 KiB     0.054 %
          IdFileName:     824,00 KiB     0.243 %
             DocTime:     456,00 KiB     0.134 %
             DocData:     532,00 KiB     0.157 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:     328,00 KiB     0.097 %

mdb_stat:


Massif graph:

Massif data:

New DB

$ balooctl indexSize
File Size: 475,26 MiB
Used:      227,86 MiB

           PostingDB:     155,05 MiB    68.048 %
          PositionDB:      16,89 MiB     7.413 %
            DocTerms:      52,60 MiB    23.083 %
    DocFilenameTerms:       1,05 MiB     0.463 %
       DocXattrTerms:       4,00 KiB     0.002 %
              IdTree:     184,00 KiB     0.079 %
          IdFileName:     804,00 KiB     0.345 %
             DocTime:     456,00 KiB     0.195 %
             DocData:     532,00 KiB     0.228 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:     336,00 KiB     0.144 %

mdb_stat:

Massif graph:

Massif data:

Results

  1. DB size has decreased (both "used" and "file size"). Which is good, although isn't a big surprise, since there is no more data duplication.
  2. Freelist size has increased as well (in "new db", its almost 60% of DB). Which bothers me quite a lot.
  3. Overall speed has increased almost 1.5x (x-axis in massif graphs are number of operations)
  4. Memory consuption has decreased, although not dramatically. Hard to interpret, since it's quite oscillating and it numbers depend on moment when massif decided to make a snapshot.

Second test: "worst-case scenario"

Large plain-text files with junk are quite painful for baloo to work with.
So I've created 14 random junk files (using for I in $(seq 1 14); do base64 /dev/urandom | head -c 10000000 | sed 's/[+/]/ /g' >file$I.txt; done) and let baloo index them (and ~100 small documents inside my home folder).

Old DB

$ balooctl indexSize
File Size: 672,83 MiB
Used:      665,63 MiB

           PostingDB:     269,63 MiB    40.507 %
          PositionDB:     303,02 MiB    45.524 %
            DocTerms:      92,92 MiB    13.960 %
    DocFilenameTerms:      20,00 KiB     0.003 %
       DocXattrTerms:            0 B     0.000 %
              IdTree:       4,00 KiB     0.001 %
          IdFileName:      16,00 KiB     0.002 %
             DocTime:      12,00 KiB     0.002 %
             DocData:       4,00 KiB     0.001 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:       4,00 KiB     0.001 %

Massif graph:

Massif data:

New DB

$ balooctl indexSize 
File Size: 399,01 MiB
Used:      398,29 MiB

           PostingDB:     304,27 MiB    76.393 %
          PositionDB:       1,04 MiB     0.261 %
            DocTerms:      92,93 MiB    23.331 %
    DocFilenameTerms:      20,00 KiB     0.005 %
       DocXattrTerms:            0 B     0.000 %
              IdTree:       4,00 KiB     0.001 %
          IdFileName:      16,00 KiB     0.004 %
             DocTime:      12,00 KiB     0.003 %
             DocData:       4,00 KiB     0.001 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:       4,00 KiB     0.001 %

Massif graph:

Massif data:

Results

  1. After 14 junk files, with old DB my memory is completely gone (baloo_file_extractor ate ~2GB, but together with massif it was ~4GB). Considering that batch size is now set to 40, it's quite easy to make baloo eat all available memory and die.
  2. Most of this memory is related to pendingOperations. So with new DB, the only memory that is left being used is for LMDB itself, which is almost 5x fewer!
  3. And again, it's almost 1.5x faster (in operations).

So, with all that, I think this change worth it.
What bothers me is worse freelist usage, which is apparently due to its fragmentation and free pages being scattered along DB.

I've seen a presentation by Howard, where he showed improvements that will come with LMDB 1.0 release. Among which there was freelist improvements, and support for large MAXKEYSIZE. It might indeed solve the freelist issue, and with second change we might even drop PositionDB completely and always store positions inside PostingDB. DB format for 1.0 release will be incompatible and require migration anyways, so we can wait for it to come out, see if it resolves our issues and make our changes to DB as well.

On the other hand, there were words about upcoming 1.0 release since ~2016, so it's hard to say when it will come out and whether we should wait...

hyc added a comment.May 25 2019, 4:41 PM

Had planned to release LMDB 1.0 this summer, but I've been a bit too busy with other things to get all the feature branches merged. Sadly, I would advise you not to hold your breath.

@bruns what do you think about it?

I don't think Baloo should store an index right on the filesystem it's going to index just to get rid of the device id. It seems tempting but I can think of multiple problems:

  1. People may not want to spend unnecessary write cycles on some types of media, especially not as a surprise
  2. The storage path is undefined because you may lack write permissions at the root of the fs, and choosing alternatives in that case makes the path undefined
  3. I think at the very least it's difficult to implement locks across multiple databases should this be needed
  4. Querying will become more difficult as multiple indexes will have to be queried and results have to be merged
  5. Having multiple databases mapped into memory at the same time increases pressure on the kernel memory manager a lot, we probably do not want that for a desktop system

Thus, the device id should be stored along with the inode number.

As pointed out initially, st_dev is unstable and UUID was suggested. But at least for btrfs that's also not a good choice because all subvolumes share the same UUID but do not share the inode namespace. Btrfs has UUID_SUB instead. Such UUIDs could become quite long, so I'd propose to create a mapping table inside the Baloo DB which registers every new unique identifier with a monotonically increasing counter which could be used as a device ID.

hyc added a comment.Oct 17 2019, 12:04 AM

hurikhan's suggestion makes good sense.