Detect mime type of local files based on their contents
Needs RevisionPublic

Authored by miklosm on Jun 3 2018, 7:51 PM.

Details

Reviewers
dfaure
broulik
Group Reviewers
Frameworks
Summary

KFileItem::determineMimeType() is an explicit order to thoroughly detect the file type, so the result of the previous detections must be ignored. The indentation should be fixed after the if block is removed, I just didn't want to complicate this patch.

I also removed an unneeded const_cast.

CCBUG: 220496

Diff Detail

Repository
R241 KIO
Lint
Lint Skipped
Unit
Unit Tests Skipped
miklosm created this revision.Jun 3 2018, 7:51 PM
Restricted Application added a project: Frameworks. · View Herald TranscriptJun 3 2018, 7:51 PM
Restricted Application added a subscriber: kde-frameworks-devel. · View Herald Transcript
miklosm requested review of this revision.Jun 3 2018, 7:51 PM
apol added a subscriber: apol.Jun 4 2018, 1:11 AM
apol added inline comments.
src/core/kfileitem.cpp
730

I'm not sure that makes sense. The fact that it's a local url doesn't mean it will be fast to read.
FUSE appears as local urls and it's slow.

ngraham added a subscriber: ngraham.Jun 4 2018, 5:13 AM
broulik added a subscriber: broulik.Jun 4 2018, 7:33 AM
broulik added inline comments.
src/core/kfileitem.cpp
730

That and also (broken) NFS mounts. We have that in quite a few places where seemingly innocent local files can freeze the entire UI, so I'd rather not introduce another place where this may happen.

miklosm added inline comments.Jun 4 2018, 9:26 PM
src/core/kfileitem.cpp
730

FUSE appears as local urls and it's slow.

People still use FUSE?? If the connection is lost, it sends all processes trying to access that volume into uninterruptable sleep.

seemingly innocent local files can freeze the entire UI

KFileItem::determineMimeType() promises proper mime type detection, and that can only be achieved by reading into the file. I agree that potentially long operations should be performed in a background thread, but the KFileItem API doesn't allow such a change, so it's the responsibility of the callers of this method.

BTW the currentMimeType() method shouldn't do any detection. It's very misleading, and when there are multiple candidates, choosing the first one on the list is not "accurate" or "real determination".

ngraham edited the summary of this revision. (Show Details)Sep 18 2018, 3:04 AM

@broulik, given your recent crusade against blocking calls, can you think of a better way to do this?

dfaure requested changes to this revision.Sep 18 2018, 7:02 AM

No, no. Too unreliable and against the MIME spec.

You're testing it for the ideal case, images, which have proper headers.
But there's no reliable "magic" (determination from content) to distinguish for example MSWord .doc vs Excel .xls, because it's all the same OLE storage format. Or many other cases like this.
This is why the MIME spec (which is implemented by mimeTypeForUrl) says: if the extension is known and matches a single mimetype, then that's the one.
This allows users to have control, rather than fuzzy algorithms.

Determination from content is only used when there is no extension, when multiple mimetypes are associated with the same extension (example: *.ogg can be audio or video), or when the extension is completely unknown.

If some users want to benefit from magic-mimetype-detection for their images, it's simple, they can just remove all extensions, KDE will take care of the rest.
But for all other cases, we want users to have control over the way their files are detected, and that's what extensions are for.

This revision now requires changes to proceed.Sep 18 2018, 7:02 AM

(BTW thanks for what you said about FUSE, I agree 100% and it makes me glad to see some people from the opposite camp, when so many people are trying to convince me that network mounts via FUSE is the solution to all problems on earth, see bug 75324)

currentMimeType() method shouldn't do any detection. It's very misleading

I disagree. This is what's called determination on demand, quite a classic pattern.

and when there are multiple candidates, choosing the first one on the list is not "accurate" or "real determination".

You skipped the fact that if there are multiple candidates, we first try to sort it out using content detection.
Only if that fails, will we then pick the first mimetype in the list, for lack of any other information. Again, see the MIME spec.

No, no. Too unreliable and against the MIME spec.

You're testing it for the ideal case, images, which have proper headers.
But there's no reliable "magic" (determination from content) to distinguish for example MSWord .doc vs Excel .xls, because it's all the same OLE storage format. Or many other cases like this.
This is why the MIME spec (which is implemented by mimeTypeForUrl) says: if the extension is known and matches a single mimetype, then that's the one.
This allows users to have control, rather than fuzzy algorithms.

Determination from content is only used when there is no extension, when multiple mimetypes are associated with the same extension (example: *.ogg can be audio or video), or when the extension is completely unknown.

I've seen web cms systems where all file downloads are named "download.php". Known extension. Matches a single mime type. Completely wrong.

The case of MSWord vs. Excel is easy: Libreoffice can open both.

If some users want to benefit from magic-mimetype-detection for their images, it's simple, they can just remove all extensions, KDE will take care of the rest.
But for all other cases, we want users to have control over the way their files are detected, and that's what extensions are for.

I accept your reasoning.

There may be an other solution to the problem I was trying to solve with this patch series: remove the jpeg thumbnailer, and let the generic image thumbnailer handle jpeg as well.

I agree, extensions are not reliable over HTTP, which is why mimeTypeForUrl doesn't use them for HTTP urls.
But here we're in KFileItem, so much more likely talking about local files or FTP/SFTP/FISH/SMB/etc. where the *.php issue doesn't happen.

The case of MSWord vs. Excel is easy: Libreoffice can open both.

But not e.g. calligra.
And it's just an example, there are a LOT more. Many mimetypes just don't have any magic at all, which would mean, with this patch, that they would never get detected.

Please note that any CMS that isn't serving files for download properly (ie. not using Content-Disposition and telling the client the name the file should be called) is broken and isn't something we can fix.

There may be an other solution to the problem I was trying to solve with this patch series: remove the jpeg thumbnailer, and let the generic image thumbnailer handle jpeg as well.

I've heard others suggesting the same, and I believe that this solution would be accepted. In fact @broulik may have already been working on it. You might want to coordinate with him.