Change DocumentID
Open, Needs TriagePublic

Description

The problem:

frameworks/baloo/src/engine/idutils.h:106
inline quint64 statBufToId(const QT_STATBUF& stBuf)
{
    // We're loosing 32 bits of info, so this could potentially break
    // on file systems with really large inode and device ids
    return devIdAndInodeToId(static_cast<quint32>(stBuf.st_dev),
                             static_cast<quint32>(stBuf.st_ino));
}

This id is central and used to access info in several databases.
An id of the backend db looks like this

typedef struct MDB_val {
	size_t		 mv_size;	/**< size of the data item */
	void		*mv_data;	/**< address of the data item */
} MDB_val;

and is used e.g. like this

MDB_val key;
key.mv_size = sizeof(quint64);
key.mv_data = static_cast<void*>(&docId);

This needs to change see T7860.
Suggested approach:

  1. Aliases:
    • using DocId = quint64;
    • using Inode = quint32;
    • using DeviceId = quint32;
    • replace quint64 with DocId ~528 times
    • substitute quint32 with Inode or DeviceId
    • #include idutils.h where needed

      This diff will be huge, but easy to review

      These changes should be neutral. Are they really?
  2. create create class DocId
    • 2 members: DeviceId and Inode
    • multiple constructors and operator implementations
    • Remove typedef quint64 DocId;

      This should still be neutral. Is it really? Does this change affect performance?
  3. centralize reading and writing of ids in DocId class
    • Hopefully this will offer the oportunity to change e.g. Inode to 64bit and a better DeviceId while retaining compatibility to existing databases with 64bit DocIds. I have the feeling that is possible, but not idea how, yet. Theoretically DocIds with different sizes can coexist within a database. However, trouble comes with code like this:
MDB_val val;
int rc = mdb_get(m_txn, m_dbi, &key, &val);
QVector<quint64> list(val.mv_size / sizeof(quint64));
memcpy(list.data(), val.mv_data, val.mv_size);

Finally, in a distant future:

  • Change DocId creation to be really unique.

There's a dilemma lurking here. Diffs should be small and digestable but the whole change should be atomic.

adridg added a subscriber: adridg.Feb 23 2018, 1:53 PM

To assess the problem, you could static_assert that st_dev and st_ino are size 4 or less. That would tell you if any system right now is losing information. (Typo in the comment in statBufToId() btw).

Adding the aliases (it's just using DocId = quint32;) is straightforward and improves the readability of the code, IMO.

Then the real trick comes in expoanding DeviceId to refer correctly to external devices. As for step 2, before such an expansion, your class can have a single quint64 data member and an operator quint64 so that using it is transparent to existing code that expects the integer representation.

@adridg: To be sure: using DocId = quint32; is a typo? You meant using DocId = quint64; instead?

michaelh updated the task description. (Show Details)Feb 23 2018, 3:08 PM
michaelh updated the task description. (Show Details)

@adridg: To be sure: using DocId = quint32; is a typo? You meant using DocId = quint64; instead?

I think he was referring to the spelling of losing. It's written loosing in the comment.

bruns added a subscriber: bruns.Mar 31 2018, 12:47 AM

sizeof(st_ino) <= 4 is obviously a no-go, this would break at least on current Linux distributions immediately.

Using the inode as a unique id (which it is, per filesystem) has pros and cons:

  1. + renaming does not change the inode
  2. + hardlinks are only indexed once
  3. - AFAIK only one path per hardlink is shown
In T8054#135850, @bruns wrote:
  1. - AFAIK only one path per hardlink is shown
$ echo ottoanna>a.csv;  echo ottoanna>b.csv;                                                                                                                                                  
$ ln a.csv aa.csv                                                                                                                                                        
$ balooctl index *                                                                                                                                                       
Indexing /tmp/xxx/aa.csv                                                                                                                                                                         
Skipping: /tmp/xxx/a.csv Reason: Already indexed                                                                                                                                                 
Indexing /tmp/xxx/b.csv                                                                                                                                                                          
File(s) indexed                                                                                                                                                                                  
$ baloosearch ottoanna
/tmp/xxx/b.csv                                                                                                                                                                                   
/tmp/xxx/aa.csv
Elapsed: 0.482286 msecs

So the pros win by 67%.
If we were to move away from inode as (part of) an id we would need an extra database for the mapping. IMO it's better to use a separate database for the names of an inode. Maybe this could also be used to improve baloo's suboptimal handling of symbolic links.
(I very much like baloo's approach to use as much info as possible from filesystem and prefer to keep it that way.)