Improve Scheduler robustness against INDI disconnections
ClosedPublic

Authored by TallFurryMan on Aug 21 2018, 7:16 AM.

Details

Summary

In the case Ekos loses connection to INDI during the shutdown procedure, bypass parking procedure and proceed to execute the shutdown script.
When a DBus error occurs while trying to control INDI devices (slewing/tracking, guiding, focusing or capturing), abort the current job, disconnect INDI (in terms of state machine) and stop Ekos.
Make Scheduler timer verify Ekos and INDI state, so that communication failures may be recovered from immediately by restarting Ekos and restarting INDI.

A few situations can lead to INDI disconnections:

  • A transitory network issue that closes the TCP stream, in which case reopening it returns to normal state (either with the running job continuing, or the running job aborted).
  • A serious network issue that prevents access to the INDI server, in which case Ekos will fail to restart and the Scheduler will stop.
  • A crash of one of the drivers, in which case Ekos might be able to reconnect on a new instance of the driver, or will loop trying to use the missing driver until it comes up again.

Obviously, it is difficult to properly handle all situations.
For instance when capturing, it may happen that the CCD driver remains in capture mode, with Ekos not being able to recover.
It may happen that the disconnection does not trigger a DBus error, but is caught while the Scheduler is checking the state of the job.
In that situation, Ekos might keep a particular state of control of a feature, but the crash might reset the properties of this feature, that state becomes invalid and unusable.

Because this robustness improvement only triggers when a communication error occurs, it is not expected to have side-effects on the normal behavior of the Scheduler.

Another issue is currently preventing all combinations of tests from being processed: the Profile field of the scheduler job is not properly handled.
It is currently not possible to have different scheduler jobs using different profiles, and once it uses a particular profile, the Scheduler is unable to switch to another by itself.

An additional issue on parking states was fixed in this differential, because the mitigation process made it very clear and easy to reproduce. The temporary workaround that was to try to unpark again when the mount was found unparked when slewing, was removed.

Test Plan

Create a scheduler job using the Simulator, with Tracking enabled, to give the tester time to kill the Simulator server.

Example of test session:
Start the Scheduler, and when it connects and starts to slew, use a terminal to find the PID of the INDI server ("ps -aef | grep indiserver") and kill it ("kill <pid>").
Ekos will immediately register the disconnection, but unfortunately will not tell the Scheduler about it.
Without the fix, the Scheduler is hung waiting for the slew to finish and must be stopped manually.
With the fix, the Scheduler notices the DBus communication error, aborts the running job and attempts to restart Ekos and reconnect to INDI.
Several test runs are needed to kill the Simulator during different stages of the job execution.

Testing systematically can be done with a parallel command such as the following

$ while true ; do sleep 1 ; if ps -aef | grep indiserver | grep -v grep ; then sleep <delay> ; killall indiserver ; fi ; done

In which <delay> is the delay to wait until the indiserver is killed.
Tested OK with various delays, killing indiserver while slewing, focusing, guiding, etc.
Raised a crash issue when killing a local indiserver while the INDI interface is talking through a pipe as https://bugs.kde.org/show_bug.cgi?id=397774

Diff Detail

Repository
R321 KStars
Branch
bugfix__shutdown_parking_with_no_indi (branched from master)
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2144
Build 2162: arc lint + arc unit
TallFurryMan created this revision.Aug 21 2018, 7:16 AM
Restricted Application added a project: KDE Edu. · View Herald TranscriptAug 21 2018, 7:16 AM
Restricted Application added a subscriber: kde-edu. · View Herald Transcript
TallFurryMan requested review of this revision.Aug 21 2018, 7:16 AM

When the scheduler attempts to connect to INDI again after a disconnection, is indiConnectionFailureCount reset? Will scheduler stop after it exhausts attempts MAX_FAILURE_ATTEMPTS or keeps looping forever?

kstars/ekos/scheduler/scheduler.cpp
3181–3186 ↗(On Diff #40115)

Code appears to be repeated multiple times. Maybe make it into a function with descriptive name?

The clear duplication of code is intentional, so that we can refactor at a later step. There are certainly other locations that could benefit from that scheme and that I did not spot yet.

From the execution flow, yes, reconnecting INDI does honor the failure counter. However, it is not possible to test this with the Simulator as that server always starts immediately and successfully. Admittedly, I should have tested on a real setup that I could disconnect.

So, tested with regular remote indiserver, and result is not that positive, but not negative neither.

First, cutting the network off has no particular impact on the INDI connection. What it does is it freezes interactions with drivers. If the mount is slewing, the dbus error makes the Scheduler disconnect the server. If the CCD is capturing, absolutely nothing happens, and the exposure counter just stays there at the same value forever. The INDI connection is never tested by Ekos.

Second, stopping the remote server is properly registered by Ekos and the INDI connection is closed. Scheduler is seeing this, and retries to connect through Ekos. You were referring to indiConnectionFailureCount, but this is not involved there. That count is only used when connecting devices. Ekos doesn't attempt to connect more than once.

So basically, differential works, but more safeguards need to be installed. Capture probably needs a timeout for instance, as it doesn't seem to be polling the driver, but merely receiving notifications.

TallFurryMan planned changes to this revision.Aug 22 2018, 9:46 PM

Adding Ekos failure count too. A few more tests showed that the state of the current job remains at "aborted" after the first disconnection, which is incorrect.

  • Add a failure count to Ekos connection.
  • Refactor connection loss mitigation, keep current job running during loss.
  • Test connection before mitigating loss.
  • Apply connection loss mitigation to more DBus errors.
  • Adding logs while testing mitigation.
  • Check Ekos and INDI states before mitigating loss of connection.
  • Fix issue on parking state, which was worked around before but now is clear.
  • Remove workaround managing unexpected park state while slewing.
TallFurryMan edited the summary of this revision. (Show Details)Aug 23 2018, 7:07 AM
TallFurryMan edited the test plan for this revision. (Show Details)
mutlaqja accepted this revision.Aug 23 2018, 9:00 AM
This revision is now accepted and ready to land.Aug 23 2018, 9:00 AM
This revision was automatically updated to reflect the committed changes.

I realize my test plan is lacking one item: I did not test that the job aborting because of twilight was properly stopping actions and guiding.
The code flow is doing that, but the test is missing from my checklist.