Add metadata extraction plugin for ODF files
ClosedPublic

Authored by kossebau on Jun 20 2018, 3:56 PM.

Details

Summary

Allows to use metadata from ODF files for defining new filenames.

Test Plan

Using [odf*] tokens in the patterns for the new filename results in usage of
the related strings in the ODF file.

Diff Detail

Repository
R758 KRename: Bulk File Renaming
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
kossebau requested review of this revision.Jun 20 2018, 3:56 PM
kossebau created this revision.

Mainly inspired from the PDF variant.

I kept the QString("") as I was not sure if there are any checks which make a difference between isEmpty() and isNull(). So code is consistent for now, and whoever will clean up things can then create proper modern code.

vandenoever requested changes to this revision.Jun 20 2018, 5:33 PM

Nice addition! You might want to apply some of the suggested changes.

src/odfplugin.cpp
109

Check return value. false means that a parse error occurred.

src/odfplugin.h
32

This appears to be text from a template.

72

All of these can be const and static in the cpp file in an anonymous namespace. No need for them in each plugin instance.

This revision now requires changes to proceed.Jun 20 2018, 5:33 PM

@vandenoever Thanks for quick turn-around review. No comment on the ODF side of things? :) That's what I hoped for, if there was anything else that could be added (or is wrong). Besides making an ODF fellow aware of this new ODF related feature ;)

src/odfplugin.h
32

I blame the pdf plugin where I copied this from ;) Guess I can just remove the API dox completely. given the interface documents things.

72

static const strings is what I had. Until I realized that it seems standard to do case-insensitive token comparison in all plugins. To avoid doing lots of compare(token, Qt::CaseInsensitive) or doing manual lowercase string duplicates, I had the idea to have the strings initally set to their cased text, set up everything with that and then turn them to lowercase, for usage in the token matching.

And given there is only one plugin instance per process, in the end there should be only one QString instance per token any way.

I agree it's not common coding :) But should be less symbols in the end. Guess worth a code comment?

kossebau updated this revision to Diff 36399.Jun 20 2018, 6:28 PM

Update to Jos' first feedback

kossebau marked 2 inline comments as done.Jun 20 2018, 6:37 PM

Cool idea and patch!

And yes, I freely admit that there is some cruft in the codebase which will need cleaning up in the future.

CMakeLists.txt
48

Would be nice to add a PURPOSE to properties, especially because the connection KF5Archive <-> ODF might not be that obvious.

src/odfplugin.cpp
45

There's a typo in the string.

src/odfplugin.h
72

Hmm, there's token = filenameOrToken.toLower() though

kossebau updated this revision to Diff 36401.Jun 20 2018, 6:47 PM
kossebau marked 2 inline comments as done.

Update to Heiko's first feedback

kossebau added inline comments.Jun 20 2018, 6:50 PM
src/odfplugin.h
72

Yes, that is the counterpart of the case-insensitive comparison done. This ensures that both the token-tested-against and the referene token are normalized to lowercase, so a normal equals comparison can be done.

This part of the code is what I copied from the other plugins, so just done for consistency :) If you prefer, happy to change to something more future-proof.

72

"reference tokens"

Cool idea and patch!

Happy you like it,

And yes, I freely admit that there is some cruft in the codebase which will need cleaning up in the future.

I would have clang-tidy & Co handy to do some simple common modernization, if you like (done already a few time elsewhere this year). Just tell me "Go" and I would run nullptr & override at least and commit to master (after manual check) :)

vandenoever requested changes to this revision.Jun 20 2018, 7:32 PM

I added a comment that was apparently lost in my first review.

src/odfplugin.cpp
116

QDomELement::tagName is not namespace aware. So if the namespace prefix is different, this code does not work. How about this?

c++
QDomElement docElem = metaData.documentElement();
if (token == m_subjectToken) {
    return docElem.elementsByTagNameNS(dc, "subject").item(0).text();
} else if (token == m_titleToken) {
    return docElem.elementsByTagNameNS(dc, "title").item(0).text();
} etc.
This revision now requires changes to proceed.Jun 20 2018, 7:32 PM

Apart from that I do not see any bugs, just differences in taste which I commented on and you can take as you see fit.

kossebau added inline comments.Jun 20 2018, 7:44 PM
src/odfplugin.cpp
116

And there I feel getting slapped over rather thoughtlessly letting myself inspire from the code of KFileMetaData's ODF extractor plugin for some afternoon hack quickie, ignoreing all the webodf/calligra xml parsing experience (or rather showing that I lost it :( ).

Having a look later in the evening when enough concentration.

Don't feel bad. I'm not even sure what this patch will do. I guess KRename has some renaming rules where you can use extracted information.

I would have clang-tidy & Co handy to do some simple common modernization, if you like (done already a few time elsewhere this year). Just tell me "Go" and I would run nullptr & override at least and commit to master (after manual check) :)

Sure, go!

Don't feel bad.

Only as much as you uncovered the "& dirty" in this so far only called "quick patch". Your proposed algorithm (if tokenA return valueFromXPathA; if tokenB return valueFromXPathB;) is more sane than doing all the comparison against any possible token in each loop in iterating the DOM. I should have spent more though on this instead of mainly copying the basic logic from the kfilemetadata extractor, which has a different purpose (collecting data) so that iteration makes sense there.

Too eager for quickly reaching "look bros, it works", need to behave better :)

Though now I am puzzled why QDomDocument & friends seems to ignore all the namespaces for me, any calls with NS yield no results and also querying the nodes directly gives empty strings for localName() and namespaceURI()... something is fishy here, need to do some fully calm poking.

I'm not even sure what this patch will do. I guess KRename has some renaming rules where you can use extracted information.

Correct.

After having discovered this and played with it, also by this ODF support patch, one gets quite some ideas how this could be integrated more with the workspace filemetadata system, instead of duplicating the extraction of embedded metadata once more. Left for another life or person though :)

QDomDocument is lazy, you have to explicitly ask for namespace processing.

http://doc.qt.io/qt-5/qdomdocument.html#setContent

src/odfplugin.cpp
108

Say yes if you want namespaces.

bool success = metaData.setContent(static_cast<const KArchiveFile*>(metaXml)->data(), true);

QDomDocument is lazy, you have to explicitly ask for namespace processing.

http://doc.qt.io/qt-5/qdomdocument.html#setContent

Thanks, having slept well with refreshed brain I had found the enlightenment instantly this morning :)

Though had not yet updated this patch, as the current code with the elementsByTagNameNS() feels fragile to me, as it does not protect against possible custom extension of the scheme where people might reuse tag names. Never seen, but now that I started to do things more correctly with preparing for non-common namespace prefixes, I also am curious how to ensure getting the element from the correct DOM position :) Any hints? Have not yet looked further than the QDom* API, without any promising things seen.

So possibly would end up with the proposed heuristic approach, just still feeling dirty ;)

vandenoever added a comment.EditedJun 21 2018, 5:38 PM

If you want to be really correct, you have to remove foreign elements first or parse in a way that you ignore foreign elements.

To find the right place in the hierarchy, you can check that the parent is the document element.

QDomElement docElem = metaData.documentElement();
if (token == m_subjectToken) {
    auto n = docElem.elementsByTagNameNS(dc, "subject").item(0);
    if (n.parent() == docElem) {
        return n.text();
    }
} else if (token == m_titleToken) {
    auto n = docElem.elementsByTagNameNS(dc, "title").item(0);
    if (n.parent() == docElem) {
        return n.text();
    }
}
kossebau updated this revision to Diff 36525.Jun 22 2018, 3:54 PM

do more namespace-safe iterating through the metadata xml

Quite a shame these old KDE docs how to workaround Qt4 deficits seem still
valid:
https://techbase.kde.org/Development/Tutorials/QtDOM_Tutorial#Loading_XML_documents_with_namespaces_using_Qt

The new iteration code should be rather safe against custom scheme extension,
right?

Looks good. Certainly good enough. I think it's readable.

vandenoever accepted this revision.Jun 22 2018, 10:02 PM
This revision is now accepted and ready to land.Jun 22 2018, 10:02 PM

Thanks for review and nudging me into better code, Jos :)
@heikobecker Any more things you would like to see improved, or can you accept this latest patch version as well now?

heikobecker accepted this revision.Jun 24 2018, 8:00 PM

@heikobecker Thanks for review as well :)

This revision was automatically updated to reflect the committed changes.