Change DocumentID
Open, Needs TriagePublic
Actions

Description

The problem:

frameworks/baloo/src/engine/idutils.h:106
inline quint64 statBufToId(const QT_STATBUF& stBuf)
{
    // We're loosing 32 bits of info, so this could potentially break
    // on file systems with really large inode and device ids
    return devIdAndInodeToId(static_cast<quint32>(stBuf.st_dev),
                             static_cast<quint32>(stBuf.st_ino));
}

This id is central and used to access info in several databases.
An id of the backend db looks like this

typedef struct MDB_val {
	size_t		 mv_size;	/**< size of the data item */
	void		*mv_data;	/**< address of the data item */
} MDB_val;

and is used e.g. like this

MDB_val key;
key.mv_size = sizeof(quint64);
key.mv_data = static_cast<void*>(&docId);

This needs to change see T7860.
Suggested approach:

Aliases:
- using DocId = quint64;
- using Inode = quint32;
- using DeviceId = quint32;
- replace quint64 with DocId ~528 times
- substitute quint32 with Inode or DeviceId
- #include idutils.h where needed
  
  This diff will be huge, but easy to review
  
  These changes should be neutral. Are they really?
create create class DocId
- 2 members: DeviceId and Inode
- multiple constructors and operator implementations
- Remove typedef quint64 DocId;
  
  This should still be neutral. Is it really? Does this change affect performance?
centralize reading and writing of ids in DocId class
- Hopefully this will offer the oportunity to change e.g. Inode to 64bit and a better DeviceId while retaining compatibility to existing databases with 64bit DocIds. I have the feeling that is possible, but not idea how, yet. Theoretically DocIds with different sizes can coexist within a database. However, trouble comes with code like this:

MDB_val val;
int rc = mdb_get(m_txn, m_dbi, &key, &val);
QVector<quint64> list(val.mv_size / sizeof(quint64));
memcpy(list.data(), val.mv_data, val.mv_size);

Finally, in a distant future:

Change DocId creation to be really unique.

There's a dilemma lurking here. Diffs should be small and digestable but the whole change should be atomic.

Related Objects

Mentioned In: T11861: Proposal: How to solve the 64-bit DocId conflict between ino and st_dev
D11038: balooctl: Add clean option to remove stale file index entries
D10807: fileindexerconfig: Exclude Vaults by default
D10829: [WIP] Use DocumentId class
D10826: [WIP] Introduce DocumentId class
Mentioned Here: T7860: Handling of removable drives and mounted shares

michaelh created this task.Feb 23 2018, 10:50 AM

michaelh claimed this task.Feb 23 2018, 11:52 AM

To assess the problem, you could static_assert that st_dev and st_ino are size 4 or less. That would tell you if any system right now is losing information. (Typo in the comment in statBufToId() btw).

Adding the aliases (it's just using DocId = quint32;) is straightforward and improves the readability of the code, IMO.

Then the real trick comes in expoanding DeviceId to refer correctly to external devices. As for step 2, before such an expansion, your class can have a single quint64 data member and an operator quint64 so that using it is transparent to existing code that expects the integer representation.

@adridg: To be sure: using DocId = quint32; is a typo? You meant using DocId = quint64; instead?

michaelh updated the task description. (Show Details)Feb 23 2018, 3:08 PM

michaelh updated the task description. (Show Details)

In T8054#129949, @michaelh wrote:

@adridg: To be sure: using DocId = quint32; is a typo? You meant using DocId = quint64; instead?

I think he was referring to the spelling of losing. It's written loosing in the comment.

michaelh mentioned this in D10826: [WIP] Introduce DocumentId class.Feb 25 2018, 11:11 AM

michaelh mentioned this in D10829: [WIP] Use DocumentId class.Feb 25 2018, 11:16 AM

michaelh mentioned this in D10807: fileindexerconfig: Exclude Vaults by default.Feb 25 2018, 2:40 PM

michaelh mentioned this in D11038: balooctl: Add clean option to remove stale file index entries.Mar 5 2018, 8:59 AM

sizeof(st_ino) <= 4 is obviously a no-go, this would break at least on current Linux distributions immediately.

Using the inode as a unique id (which it is, per filesystem) has pros and cons:

+ renaming does not change the inode
+ hardlinks are only indexed once
- AFAIK only one path per hardlink is shown

In T8054#135850, @bruns wrote:

- AFAIK only one path per hardlink is shown

$ echo ottoanna>a.csv;  echo ottoanna>b.csv;                                                                                                                                                  
$ ln a.csv aa.csv                                                                                                                                                        
$ balooctl index *                                                                                                                                                       
Indexing /tmp/xxx/aa.csv                                                                                                                                                                         
Skipping: /tmp/xxx/a.csv Reason: Already indexed                                                                                                                                                 
Indexing /tmp/xxx/b.csv                                                                                                                                                                          
File(s) indexed                                                                                                                                                                                  
$ baloosearch ottoanna
/tmp/xxx/b.csv                                                                                                                                                                                   
/tmp/xxx/aa.csv
Elapsed: 0.482286 msecs

So the pros win by 67%.
If we were to move away from inode as (part of) an id we would need an extra database for the mapping. IMO it's better to use a separate database for the names of an inode. Maybe this could also be used to improve baloo's suboptimal handling of symbolic links.
(I very much like baloo's approach to use as much info as possible from filesystem and prefer to keep it that way.)

hurikhan77 mentioned this in T11861: Proposal: How to solve the 64-bit DocId conflict between ino and st_dev.Oct 11 2019, 10:21 PM

Change DocumentIDOpen, Needs TriagePublicActions

Description

Related Objects

Change DocumentID
Open, Needs TriagePublic
Actions