baloodb: Add clean command
ClosedPublic

Authored by michaelh on Mar 27 2018, 3:46 PM.

Details

Summary

Remove stale entries from database

This is mostly based on D11038

Depends on D11745

Test Plan
$ baloodb clean  --mounted-only  
Removing stale database contents...
Ignoring device 2086  
Ignoring device 2098
Ignoring device 2069
Ignoring device 2064
Ignoring device 2066
Database has corrupted entries baloo may misbehave, please recreate the DB by running $ balooctl disable && balooctl enable
Removing: device: 2049 inode: 4861 url: /tmp/otto
...

Removing: device: 2053 inode: 39208001 url: //Little Witch Academia - 08.mkv
Removing: device: 2053 inode: 39215245 url: 
Removed 14965 items
Elapsed: 4.6921 secs

Second run:

$ baloodb clean  --mounted-only  
Removing stale database contents...
access filter QFlags(0x4)
Ignoring device 2086  
Ignoring device 2098
Ignoring device 2066
Ignoring device 2064
Ignoring device 2069
Removing: device: 22 inode: 225383 url: 
Removing: device: 2053 inode: 34612126 url: 
Removed 14943 items
Elapsed: 3.47939 secs

Diff Detail

Repository
R293 Baloo
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
michaelh created this revision.Mar 27 2018, 3:46 PM
Restricted Application added projects: Frameworks, Baloo. · View Herald TranscriptMar 27 2018, 3:46 PM
michaelh requested review of this revision.Mar 27 2018, 3:46 PM
michaelh edited the summary of this revision. (Show Details)
michaelh edited the summary of this revision. (Show Details)Mar 27 2018, 3:51 PM
michaelh updated this revision to Diff 30737.Mar 27 2018, 6:01 PM
  • Account for symbolic links
cfeck added a subscriber: cfeck.Mar 30 2018, 11:42 AM
bruns added a subscriber: bruns.Mar 31 2018, 3:01 AM
bruns added inline comments.
src/engine/databasesanitizer.cpp
113 ↗(On Diff #30737)

I think if you initialize fileinfo anyway, you should use fileinfo.exists() ...

316 ↗(On Diff #30737)

I think it is better to use
QT_FSTAT(info.symlink.toLocal8Bit().constData(), ...) here, avoids lots of calls to the database, and guarantees more consistent results - symlinkTarget() works on the filesystem, so should the lookup here

317 ↗(On Diff #30737)

missing check id != 0, or the equivalent if the above code is changed to QT_FSTAT

333 ↗(On Diff #30737)

why not just make the removeDocument transaction above depend on dryRun? removeDocument is expensive ...

src/tools/baloodb/main.cpp
251 ↗(On Diff #30737)

make it depend on dry-run?

michaelh added inline comments.Mar 31 2018, 11:08 AM
src/engine/databasesanitizer.cpp
333 ↗(On Diff #30737)

I did that before, but decided on --dry-run to simulate the process as close as possible.
Also, compared to the time it takes to collect the data, removing the docs is completely neglectible. Committing to the database takes some seconds (on ~30000 entries) again.

src/tools/baloodb/main.cpp
251 ↗(On Diff #30737)

Transaction complains (Q_ASSERT) when trying to remove a document on a readonly db or readonly transaction.
Unless I guard removeDocument with an if (which I prefer not to) both have to be in readwrite mode.

michaelh updated this revision to Diff 31033.Mar 31 2018, 12:18 PM
  • Apply suggested change
michaelh marked an inline comment as done.Mar 31 2018, 12:19 PM
michaelh added inline comments.
src/engine/databasesanitizer.cpp
316 ↗(On Diff #30737)

I've tried it. Sadly your suggestion does not work. With fi = filePathToStat(info.symlink.toLocal8Bit()) fi._st_dev is == 0 when the symlink target does not exist. Hence it does tell me why the link can't be followed. baloo's db otoh knows about this. As deviceIdFilter does never contain 0 symlinks would be removed when they should be ignored. With If(id != 0) truly "dead" symlinks will not be removed.
Maybe I didn't understand what you're suggesting?

filePathToStat returns QT_STATBUF. My guess is that is essentially the same as QT_FSTAT, at least man fstat.2 says so.

bruns added inline comments.Apr 2 2018, 9:44 PM
src/engine/databasesanitizer.cpp
316 ↗(On Diff #30737)

What do you consider a "truly dead" symlink?

If you really want to do it correctly, you have to walk the file system yourself, one symlink target path component at a time - each path component can be a symlink itself, or a mount point. You should check the device id for *every* path component.

michaelh planned changes to this revision.Apr 3 2018, 10:13 AM
michaelh added inline comments.
src/engine/databasesanitizer.cpp
316 ↗(On Diff #30737)

Damn, you're right! I was fooled by this:

$ balooctl index /mnt/otto/test.mp4
$ ln -s /mnt/otto/test.mp4 ~/Videos/

baloo now also indexes ~/Videos/test.mp4
Only in this case m_pimpl->m_transaction->documentId(info.symlink.toLocal8Bit()); will return an id != 0

I did not notice because every indexed symbolic link on my system also has its target indexed. I never saw id == 0

michaelh updated this revision to Diff 31208.Apr 3 2018, 12:05 PM
  • Ignore symbolic links

There seems to be no reliable way to map a symbolic link's target
to a device id. Instead of removing it print a parseable message
so users can remove it manually.

bruns added inline comments.Apr 4 2018, 11:29 PM
src/engine/databasesanitizer.cpp
333 ↗(On Diff #30737)

Fair reason, the comment below is also mood then ...

michaelh added inline comments.Apr 5 2018, 12:40 PM
src/engine/databasesanitizer.cpp
333 ↗(On Diff #30737)

8 meanings of MOOD acronym or abbreviation.
MOOD stands for

Magic of Ordinary Days
Meet only Original Designs
Michigan Out of Doors
Movie Organizer Online Database
Miata Owners on Delmarva
Methodology for Object Oriented Design
Metrics for Object Oriented Design
Museum of Outstanding Design

typo? :-)

bruns added inline comments.Apr 5 2018, 9:07 PM
src/engine/databasesanitizer.cpp
333 ↗(On Diff #30737)

s/mood/moot/

michaelh planned changes to this revision.Apr 12 2018, 3:29 PM
michaelh updated this revision to Diff 32105.Apr 14 2018, 11:13 AM
  • Adapt to upstream changes
bruns requested changes to this revision.Apr 15 2018, 2:33 AM
bruns added inline comments.
src/engine/experimental/databasesanitizer.cpp
165

Why do you remove the private: here?

224

Is this struct still used?

232

duplicates code from idutils.h

363

Restructure:
Iterate over listResult.first to populate with deviceIds
Iterate over result to set each entry to ignore/use.

380

remove sep, use `QStringLiteral(" device: %1") instead

381

dito sep, indent to align << with line above

This revision now requires changes to proceed.Apr 15 2018, 2:33 AM
michaelh updated this revision to Diff 32175.Apr 15 2018, 11:42 AM
  • Improve access filtering
michaelh added inline comments.
src/engine/experimental/databasesanitizer.cpp
165

340: m_pimpl->m_transaction->removeDocument(info.id);
and more

380

I'd like to keep it this way.

  1. For consistency with printList and printDevices
  2. I'm thinking in the opposite direction: Make sep global to easily change it to another value for all output or even add a field-separator getter/setter ('--separator') should the need arise. (in a separate patch, of course)
bruns added inline comments.Apr 15 2018, 9:31 PM
src/engine/experimental/databasesanitizer.cpp
360

do the calls to m_transaction using a wrapper, e.g. DatabaseSanitizer::removeDocument(info.id)

370

dito

src/tools/experimental/baloodb/main.cpp
110

remove empty line

michaelh updated this revision to Diff 32254.Apr 16 2018, 8:05 AM
  • Remove unrelated whitespace change
michaelh updated this revision to Diff 32255.Apr 16 2018, 8:27 AM
  • Wrap transaction operations
src/engine/experimental/databasesanitizer.cpp
360

I'm assuming you meant DatabaseSanitizerImpl::removeDocument(info.id). If you meant DatabaseSanitizer::removeDocument, please explain what the advantage is.

bruns added inline comments.Apr 16 2018, 4:06 PM
src/engine/experimental/databasesanitizer.cpp
187

Are you planning to use this one anywhere else?

Otherwise, inside deviceFilters(...):

auto isIgnored = [&accessFilter](const QStorageInfo& storageInfo) {
    if (storageInfo.isValid() && ...) {
        ...
    }
}
189

bool mounted = storageInfo.isValid()

191

This else is ok, although not necessary, it emphasizes symmetry (mounted, not mounted)

193

Remove the else, and add a newline here

    ...
}

if (storageInfo.fileSystemType() == ...) {
    ...
    return true;
}

return false;
210

->commit()

360

Right.

376

Nitpick - one more space (align " and summary)

bruns added inline comments.Apr 16 2018, 4:36 PM
src/engine/experimental/databasesanitizer.cpp
173

Pass in a List/Set of deviceIds, and return the same type.

380

My preference is:

  • Use one fixed format string for human readable output, which is translatable
  • When the need arises to e.g. have something machine readable, use a different format string.
michaelh updated this revision to Diff 32320.Apr 16 2018, 7:11 PM
michaelh marked 17 inline comments as done.
  • Apply suggested changes
michaelh added inline comments.Apr 16 2018, 7:13 PM
src/engine/experimental/databasesanitizer.cpp
173

Your suggestion in D12222 can be reused. I'll do it there.

187

Are you planning to use this one anywhere else?

Yes. For all output: D12222

bruns accepted this revision.Apr 17 2018, 11:47 PM
This revision is now accepted and ready to land.Apr 17 2018, 11:47 PM
This revision was automatically updated to reflect the committed changes.