Prevent misdetection of EOF on Linux
Needs ReviewPublic

Authored by Lekensteyn on Mar 8 2017, 10:51 AM.

Details

Reviewers
ossi
hindenburg
Group Reviewers
Konsole
Summary

On Linux, even if slave activity is detected on a master fd, FIONREAD
can still return zero available bytes in the buffer. Previously this
would prevent further signals from being emitted even if new IO arrived
(since the readNotifier is disabled) which in its turn prevented the
Konsole output from being updated.

A race condition in the kernel is suspected, the first known broken
kernel is v4.1.10-89-g5eb491ba5d06.

BUG: 372991

Test Plan

Tested in conjunction with konsole v16.12.1-60-g151215a9 with this debug patch applied:

diff --git a/src/kptydevice.cpp b/src/kptydevice.cpp
index 92b443b..4dfcd02 100644
--- a/src/kptydevice.cpp
+++ b/src/kptydevice.cpp
@@ -273,6 +273,7 @@ struct KPtyDevicePrivate : public KPtyPrivate {
     KRingBuffer writeBuffer;
 };
 
+#include <cstdio>
 bool KPtyDevicePrivate::_k_canRead()
 {
     Q_Q(KPtyDevice);
@@ -310,6 +311,9 @@ bool KPtyDevicePrivate::_k_canRead()
         // select() will trigger again anyway if an EOF condition was found, and
         // only then we will accept it.
         if (!available && !maybeEof) {
+            FILE *fp = fopen("/tmp/EOF", "a");
+            fprintf(fp, "Maybe EOF!?\n");
+            fclose(fp);
             maybeEof = true;
             return true;
         } else if (available) {

Indeed, the /tmp/EOF file is being written and the Konsole output no longer "hangs".

Closing a tab (via EOF in bash or closing the tab) somehow does not trigger this debug print, perhaps the watcher is disabled elsewhere. (This is an observation and not a problem.)

Diff Detail

Repository
R291 KPty
Lint
Lint Skipped
Unit
Unit Tests Skipped
Lekensteyn created this revision.Mar 8 2017, 10:51 AM
Restricted Application added a project: Frameworks. · View Herald TranscriptMar 8 2017, 10:51 AM
Restricted Application added a subscriber: Frameworks. · View Herald Transcript
hindenburg edited edge metadata.Mar 8 2017, 1:57 PM

Thanks, this needs to get reviewed by the framework team that handles KPty.

cfeck added a reviewer: ossi.Mar 17 2017, 4:47 PM
ossi edited edge metadata.Mar 17 2017, 5:03 PM

please verify that this is in fact a kernel bug (include ml references), not something that should be expected for some weird reasons (compare the solaris path).
if this is a bug, does it actually affect released versions which are not being "upgraded away" on short notice? i wouldn't want a workaround for a bug that had a very short life span.

src/kptydevice.cpp
307

this should be constrained more - affected versions, and the fact that it's a bug.

315

need no else after return.

In D4975#95727, @ossi wrote:

please verify that this is in fact a kernel bug (include ml references), not something that should be expected for some weird reasons (compare the solaris path).

I suspect it is a kernel bug, but have no ml references. It seems to be a race condition for which it is hard to prove that it does not exist.
See https://bugs.kde.org/show_bug.cgi?id=372991#c10, poll returned but FIONREAD outputted 0 bytes.

if this is a bug, does it actually affect released versions which are not being "upgraded away" on short notice? i wouldn't want a workaround for a bug that had a very short life span.

This issue has been there for several months (maybe even a year) and I regularly (multiple times a week) hit this condition when interrupting a program (today it even happened when interrupting a git push which was waiting for the SSH passphrase).

I have tried to pinpoint the root cause (but failed even after spending hours), what else should I do or can I try?

src/kptydevice.cpp
307

Constrained in what sense? Should the first known broken kernel version be noted here?

ossi added a comment.Mar 17 2017, 10:46 PM

first try googling it; maybe there is even a response on stackoverflow (as so often happens).
then try posting it yourself to stackoverflow.
dunno whether such questions are welcome on the linux-kernel list, but if not, there might be other lists where it would fit.

src/kptydevice.cpp
307

basically what you put in the commit message.

In D4975#95754, @ossi wrote:

first try googling it; maybe there is even a response on stackoverflow (as so often happens).

Was one of the first things to try, that only pointed to an issue in Emacs and another that clarified that FIONREAD does not work for things like UDP sockets.

then try posting it yourself to stackoverflow.
dunno whether such questions are welcome on the linux-kernel list, but if not, there might be other lists where it would fit.

I doubt that SO would know the answer to this, *some* Linux kernel list would probably work, but then a stripped down testcase is likely needed and I don't really feel like wasting much more time on this. The problem is real.

ossi added a comment.Mar 17 2017, 11:09 PM

yeah, but without knowing the root cause, your fix may be just wrong, or at least the comment/commit message may be misleading.
also, https://community.kde.org/Policies/Commit_Policy#Don.27t_commit_code_you_don.27t_understand

(FWIW, I ran a patched kpty 5.32.0-1 version for a while and didn't have the hang issue until I recently upgraded to an unpatched 5.33.0-1. Haven't had a moment so far to dig deeper into the issue though.)