Thumbnail text: use libmagic to detect encoding
Needs ReviewPublic

Authored by meven on May 3 2020, 10:24 AM.

Details

Reviewers
sitter
ngraham
Group Reviewers
Frameworks
Summary

libmagic is the libraray used in file utility to search for file encoding based on heuristics.

QTextCodec::codecForUtfText is limited to detecting UTF-8 BOM encoded files.
If this fails QTextCodec::codecForLocale is used to decode all text files, which can be quite incaccurate.

When QTextCodec::codecForUtfText did not find a valid UTF-* file, use libmagic to find a TextCodec.
It can better confirmm UTF-8 presence.
latin-1 is used as default 8-bit ascii codex when libmagic cannot find a precise result.

BUG: 316390
FIXED-IN: 20.08

Test Plan

In doplhin, see preview of different non utf-8 encoded text files.

Diff Detail

Repository
R320 KIO Extras
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 26261
Build 26279: arc lint + arc unit
meven created this revision.May 3 2020, 10:24 AM
Restricted Application added projects: Dolphin, Frameworks. · View Herald TranscriptMay 3 2020, 10:24 AM
Restricted Application added subscribers: kfm-devel, kde-frameworks-devel. · View Herald Transcript
meven requested review of this revision.May 3 2020, 10:24 AM
pino added a subscriber: pino.May 3 2020, 11:07 AM

why not KEncodingProber from the KCodecs framework, like also the commented bits hint about?

meven added a comment.May 4 2020, 8:30 AM
In D29381#662084, @pino wrote:

why not KEncodingProber from the KCodecs framework, like also the commented bits hint about?

I tested first KEncodingProber, it just returns bad data, returning "UTF-8" sometimes for what it is not, or "windows-something" for what is ISO-8859.

I have zero background knowledge here, but it really feels like there must be an existing solution to this problem other than libmagic. Like how does kate figure out the encoding of a text file.

cmake/Findlibmagic.cmake
1

bad copypaste

64

It's more idiomatic to define an imported target, see some of the newer finders in ECM.

I do also rather wonder if we couldn't just use FindPkgConfig's imported target, in theory pkgconfig also works on windows. I may well be ignorant of the finer points of windows support though.

thumbnail/textcreator.cpp
38

TBH, I would make libmagic required for building the thumbnail plugin. I can't see much of a rationale for why we'd want to support "broken"/insufficient encoding detection when there's code that makes it better.

65

* on wrong side of space

67

better name than m? (:

69

excess whitespace towards the end. I also wonder if qfile::encodename would be better

70

I guess you could just toUpper on a QBA instead of going through a temporary QString since ret is an encoding identifier ajnd would be always an ascii string.
Also, can be const it seems.

78

unkwnwn typo

meven added inline comments.May 5 2020, 7:56 AM
thumbnail/textcreator.cpp
38

Without libmagic, it is current state basically UTF-8 with bom detection otherwise local codec.

I did not test exhaustive encodings so I wanted to let the door open for users to not rely on libmagic.
libmagic works well from what I've tested but I could not be absolutely sure for the multiple encodings out there.
Hopefully libmagic does a better job detecting UTF-8 (which I saw) but for users not using much UTF-8...

And libmagic loads a 5M file storing its heuristics each time it loads ( /usr/share/misc/magic.mgc ).
It would be great to keep this in memory somewhere, maybe a static.

sitter added inline comments.May 5 2020, 2:07 PM
thumbnail/textcreator.cpp
38

Perhaps it'd make sense to refactor this a bit and construct some test cases around encoding detection so we get a sense of reliablity?

The way I am looking at this: either libmagic always does the best job at detecting encodings, at which point we'll want it as a required dep, or there's something better in which case we don't want libmagic at all and instead use the something better ;)

In the end the user isn't necessarily in charge of what a random file will be encoded with, so I don't think there's a point in letting the user (or the distro) build an inferior product by accidentally not including libmagic. The truth is neither we nor the user can with any certainty say what encodings the thumbnailer will encounter.

meven added a comment.May 6 2020, 7:02 AM
Perhaps it'd make sense to refactor this a bit and construct some test cases around encoding detection so we get a sense of reliablity?

The way I am looking at this: either libmagic always does the best job at detecting encodings, at which point we'll want it as a required dep, or there's something better in which case we don't want libmagic at all and instead use the something better ;)

In the end the user isn't necessarily in charge of what a random file will be encoded with, so I don't think there's a point in letting the user (or the distro) build an inferior product by accidentally not including libmagic. The truth is neither we nor the user can with any certainty say what encodings the thumbnailer will encounter.

Well I am doubting now libmagic is the way to go:
It is quite limited in encoding detection :
https://invent.kde.org/snippets/875
Compared to https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings

meven updated this revision to Diff 82251.May 8 2020, 10:22 AM

Make Thumbnail ioslave devicePixelRatio capable

meven updated this revision to Diff 83055.May 19 2020, 6:45 AM
meven marked 5 inline comments as done.

Use QByteArray, find typo, code style and naming

LGTM. Seeing as I don't have much background knowledge I'm not comfortable accepting though. I guess if nobody comes up with better options by next week feel free to land.

meven added a comment.May 19 2020, 4:44 PM

I am only half satisfied by the patch.
Mostly because of libmagic magic_load that loads a 5M file each time which is not needed to detect encoding.

I would add some tests before landing anyways.

Browsing the code it looks like it mmaps the file though? And when I add some strategic sleeping I can verify that file goes towards shared memory.

Oh, I suppose the trouble is that load gets called each ::create? Simply wrap it in a cpp class and static scope it to something maybe?

The finder by the way is still in disarray I've noticed. Please look at one of the more modern finds in ECM for reference. Also I think the find_library call is wrong, it should look for 'magic' not 'libmagic'

Each library name given to the NAMES option is first considered as a library file name and then considered with platform-specific prefixes (e.g. lib) and suffixes (e.g. .so).

i.e. either the NAME is libmagic.so (which is a file) or magic (which gets platform adjusted), but probably not libmagic as that is neither a file nor will the platform adjusted liblibmagic.so be one.