Investigate memory issues in Baloo
Open, Needs TriagePublic

Description

There are several "memory leak" bugs in BugZilla, i.e. 359119, 371659, 380456, 384088, 386791, 394750, etc.

I've actually noted that is somewhat reproducible on my machine, namely when I perform initial indexing after balooctl disable && balooctl enable, the private memory does grow steadily (from ~10MB in the beginning up to ~300MB in the end). Not fatal - but that's just several GB of documents/books/other stuff. Other users can have it much worse.
Looking at the code and using valgrind gave no results - as far as I can see, there is no direct memory leak there, at least it does not find any memory with no pointers to it.

Most of the memory is inside heap, so I tried massif, which revealed that memory usage grows steadily inside LMDB, inside mdb_page_malloc. It looks like those are related to the dirty pages (most of them come from mdb_page_touch calls), i.e. pages loaded to memory and not yet written to disk (?)

Following is some braindump, don't take it too seriously:

  • Why doesn't lmdb free those pages after mdb_txn_commit, when it should write those on a disk and get rid of them? We don't need them anymore. Did they go to mdb_env dirty list? Why? Maybe, as a quick-and-dirty hack we can close&reopen mdb_env? Or even restart the whole baloo_extractor_process after each batch (40 files)?
    • AFAIK, LMDB should start spilling it on a disk of there is too many of them. I don't actually know the threshold, and whether it works in our case (maybe I was just below that limit?)
    • Dirty list going insane can actually cause a crash, which seem also to be quite popular in BugZilla - bug 389848. The assert, which causes it to crash, actually says us that LMDB was unable to add page to dirty list. One of the reasons is list being too large.
  • Using LMDB with MDB_WRITEMAP property can in principle avoid those mallocs, as it will write those pages directly to the file. There is a drawback: it allocates a DB file of the maximal size of mmap (i.e. 256GB for 64bit-system), which is kinda weird. Although it uses "sparse" files, which physically does not take that much space on a disk, but not all OS (i.e. OSX) support it
  • Another (yet related) issue: a single 10MB plain-text file (which is an upper threshold, above which we'll ignore such file) - can be generated using base64 /dev/urandom | head -c 10000000 | sed 's/[+/]/ /g' >file.txt - takes almost 300MB of memory to index. If there will be 40 of those inside a batch, baloo_file_extractor will go mad: the transaction is just way too large. Lots of txt-files combined with that leak will just kill a workstation. Possible workarounds:
    • Make the batch size smaller
    • Lower the threshold of plain-text files
    • Let baloo_file_extractor itself decide the size of a batch, so that transaction will be not too large (probably the best solution here)
poboiko created this task.Oct 15 2018, 4:44 PM
bruns added a comment.Oct 15 2018, 7:14 PM

I think most of the mentioned "leaks" are really memory usage. If a large file is indexed, the contents (extracted plain text) are temporarily held in memory.

When the MMAPed pages are written back to disk, these should be essentially free - still in memory, but no longer dirty. LMDB also needs some additional memory for its housekeeping (you don't want to rebuild e.g the freelist from the on-disk image on every transaction). It also needs memory for the write transactions as scratch space - this space is probably not given back to the system, waiting to be reused by the next write transaction.

baloo_extractor_process exits after being idle for some time, so this memory is also returned.

Finished transactions are automatically "spilled" to disk by the kernel, as the whole DB is MMAPed. LMDB never does this explicitly, the kernel does this automatically.

MDB_WRITEMAP is quite dangerous, as it exposes the DB memory as writable memory, any write can write *anywhere* in the database.

Base64 encoded files should be omitted from the index alltogether, both plain Base64 encoded files or gpg2 -a ... output is meaningless to index.

Base64 output is also extremely unfriendly to the DB, as it contains many random terms which fill the DB. base64 encoded data is not representative of a natural language text, which has a significantly smaller number of terms.

Here is some test data for my (rather simple) setup.

Memory consumption statistics for a "first run":


The setup is pretty simple one, about ~10000 documents, total size is ~4GB.
As you can see, dirty memory (which is mostly heap) is ~280MB, which is almost the size of the whole index (without freelist), and ~70% of the whole index. I would say, that's too much memory for just book-keeping!

Here's another run, with massif on the same setup:


Memory usage is much smaller (~90MB in the end, dpn't really know why), but it shows that mostly this memory is consumed by LMDB itself, and it grows steadily.

In T9873#164218, @bruns wrote:

I think most of the mentioned "leaks" are really memory usage. If a large file is indexed, the contents (extracted plain text) are temporarily held in memory.

Sure, the part of "braindump" related to plain text indexing is not a memory leak. The point is that memory usage there can still be pretty high, and it would be nice to find a way to restrict it somehow.

baloo_extractor_process exits after being idle for some time, so this memory is also returned.

I believe those leaks are mostly related to the "first run", when it has to work for quite long time.

Finished transactions are automatically "spilled" to disk by the kernel, as the whole DB is MMAPed. LMDB never does this explicitly, the kernel does this automatically.

I don't know much about internals of LMDB, but it looks like there are mdb_page_{malloc,new,touch,spill} methods, which actually does some of it. And, well, it eats quite a lot of memory (see above)!

MDB_WRITEMAP is quite dangerous, as it exposes the DB memory as writable memory, any write can write *anywhere* in the database.

Cannot object to that. I don't like that idea either.

Base64 encoded files should be omitted from the index alltogether, both plain Base64 encoded files or  `gpg2 -a ...` output is meaningless to index.

True. But it would be nice if there were a nice way to decide whether a plain-text file is meaningless or not.

Base64 output is also extremely unfriendly to the DB, as it contains many random terms which fill the DB. base64 encoded data is not representative of a natural language text, which has a significantly smaller number of terms.

Sure, that's why I tried it - I intentionally tried to do the worst-case scenario. But still, stuff like that can happen - for example, Baloo uses PlainTextExtractor for EPS-files (because image/eps inherits text/plain, since those are indeed text-files), and if EPS is quite large - we get the same result. Lots of meaningless garbage inside index, and quite high memory consumption.
(the latter is bug, but there might be more of it - like gpg2 -a you've mentioned - and the chances we cover all the cases are small; yet we don't want Baloo to kill the machine memory either).