There are several "memory leak" bugs in BugZilla, i.e. 359119, 371659, 380456, 384088, 386791, 394750, etc.
I've actually noted that is somewhat reproducible on my machine, namely when I perform initial indexing after balooctl disable && balooctl enable, the private memory does grow steadily (from ~10MB in the beginning up to ~300MB in the end). Not fatal - but that's just several GB of documents/books/other stuff. Other users can have it much worse.
Looking at the code and using valgrind gave no results - as far as I can see, there is no direct memory leak there, at least it does not find any memory with no pointers to it.
Most of the memory is inside heap, so I tried massif, which revealed that memory usage grows steadily inside LMDB, inside mdb_page_malloc. It looks like those are related to the dirty pages (most of them come from mdb_page_touch calls), i.e. pages loaded to memory and not yet written to disk (?)
Following is some braindump, don't take it too seriously:
- Why doesn't lmdb free those pages after mdb_txn_commit, when it should write those on a disk and get rid of them? We don't need them anymore. Did they go to mdb_env dirty list? Why? Maybe, as a quick-and-dirty hack we can close&reopen mdb_env? Or even restart the whole baloo_extractor_process after each batch (40 files)?
- AFAIK, LMDB should start spilling it on a disk of there is too many of them. I don't actually know the threshold, and whether it works in our case (maybe I was just below that limit?)
- Dirty list going insane can actually cause a crash, which seem also to be quite popular in BugZilla - bug 389848. The assert, which causes it to crash, actually says us that LMDB was unable to add page to dirty list. One of the reasons is list being too large.
- Using LMDB with MDB_WRITEMAP property can in principle avoid those mallocs, as it will write those pages directly to the file. There is a drawback: it allocates a DB file of the maximal size of mmap (i.e. 256GB for 64bit-system), which is kinda weird. Although it uses "sparse" files, which physically does not take that much space on a disk, but not all OS (i.e. OSX) support it
- Another (yet related) issue: a single 10MB plain-text file (which is an upper threshold, above which we'll ignore such file) - can be generated using base64 /dev/urandom | head -c 10000000 | sed 's/[+/]/ /g' >file.txt - takes almost 300MB of memory to index. If there will be 40 of those inside a batch, baloo_file_extractor will go mad: the transaction is just way too large. Lots of txt-files combined with that leak will just kill a workstation. Possible workarounds:
- Make the batch size smaller
- Lower the threshold of plain-text files
- Let baloo_file_extractor itself decide the size of a batch, so that transaction will be not too large (probably the best solution here)