Proposal: How to solve the 64-bit DocId conflict between ino and st_dev
Open, Needs TriagePublic

Description

After some discussion with Martin Steigerwald, he managed to have a great idea without knowing:

We could add another table to the database that maps FS UUIDs to a monotonically increasing counter, let's call it "FsUuidMappingDB". It will use a FS UUID as key, and the counter as value. Each time we detect a previously unknown UUID, we will increase the counter by one and write a new mapping.

When we now scan for files, we can use the 64-bit inode number XOR'ed with the bit-reversed UUID-mapping value to create a DocID. I'm not sure how likely it is to create any collision at some point but currently I think it would be much better than using only 32-bit values with unstable 32-bit st_dev numbers. Thanks to using bit-reversed values and XOR, with each newly discovered filesystem, we would cut our ino namespace only into half. On a typical system this means we would maybe loose 3-4 bits for the most important filesystems to be indexed. Even if someone swaps a lot of portable disks and those would be indexed by baloo, it is unlikely to collide early because the least changing bits of the counter would only intefere with the oldest inode numbers which may no longer be used at all by the system (because due to file changes, rewrite, package updates, all those inodes have been replaced).

T9805 T8066 T8054

Update: Encoding a device ID as outlined above doesn't work quite well because many functions in Baloo currently expect they can do a reverse mapping. Maybe it's better to expand the ID storage to 128 bits. The reverse lookup just compares the device id to the mounted file systems but st_dev may be unstable across reboots, and even between unmounts and remounts.

Looks like looking up the FS UUID can only be done by root (it involves libblkid reading the superblock directly). Does the kernel have some syscall to look that up, or why doesn't it? Also, st_dev is not stable on btrfs (and many other FSes that use virtual devices).

The libc docs don't claim st_dev to be stable across reboots or crashes: https://www.gnu.org/software/libc/manual/html_node/Attribute-Meanings.html

It just claims to uniquely identify a file, that doesn't necessarily mean that it always the same file. Thus, you can only safely assume that by comparing stat of two files that they are not the same file when st_dev/st_ino differ, within the same boot cycle.

Any ideas how to uniquely identify the file system across reboots? We could lookup the mount point, but that doesn't really help when different devices become mounted at the same mount point.

hurikhan77 updated the task description. (Show Details)Oct 12 2019, 12:25 PM

Would it be possible to obtain the filesystem UUID through Solid / udisks? Hmmm, udisks goes by disk, not by filesystem. But, well, I have an idea, you can get UUIDs as regular user by looking in /dev/disk/by-uuid.

We need a unique identifier per inode namespace, and for btrfs this is per subvolume and not per disk. Currently, baloo uses QStorageInfo to obtain what it needs. I'll look into Solid and udisks but I fear they will be per disk only, too.