Improve Scheduler robustness against INDI disconnections
0fa94d809979
Actions

Authored by TallFurryMan on Aug 23 2018, 9:01 AM.

Description

Improve Scheduler robustness against INDI disconnections

Summary:
In the case Ekos loses connection to INDI during the shutdown procedure, bypass parking procedure and proceed to execute the shutdown script.
When a DBus error occurs while trying to control INDI devices (slewing/tracking, guiding, focusing or capturing), abort the current job, disconnect INDI (in terms of state machine) and stop Ekos.
Make Scheduler timer verify Ekos and INDI state, so that communication failures may be recovered from immediately by restarting Ekos and restarting INDI.

A few situations can lead to INDI disconnections:

A transitory network issue that closes the TCP stream, in which case reopening it returns to normal state (either with the running job continuing, or the running job aborted).
A serious network issue that prevents access to the INDI server, in which case Ekos will fail to restart and the Scheduler will stop.
A crash of one of the drivers, in which case Ekos might be able to reconnect on a new instance of the driver, or will loop trying to use the missing driver until it comes up again.

Obviously, it is difficult to properly handle all situations.
For instance when capturing, it may happen that the CCD driver remains in capture mode, with Ekos not being able to recover.
It may happen that the disconnection does not trigger a DBus error, but is caught while the Scheduler is checking the state of the job.
In that situation, Ekos might keep a particular state of control of a feature, but the crash might reset the properties of this feature, that state becomes invalid and unusable.

Because this robustness improvement only triggers when a communication error occurs, it is not expected to have side-effects on the normal behavior of the Scheduler.

Another issue is currently preventing all combinations of tests from being processed: the Profile field of the scheduler job is not properly handled.
It is currently not possible to have different scheduler jobs using different profiles, and once it uses a particular profile, the Scheduler is unable to switch to another by itself.

An additional issue on parking states was fixed in this differential, because the mitigation process made it very clear and easy to reproduce. The temporary workaround that was to try to unpark again when the mount was found unparked when slewing, was removed.

Test Plan:
Create a scheduler job using the Simulator, with Tracking enabled, to give the tester time to kill the Simulator server.

Example of test session:
Start the Scheduler, and when it connects and starts to slew, use a terminal to find the PID of the INDI server ("ps -aef | grep indiserver") and kill it ("kill <pid>").
Ekos will immediately register the disconnection, but unfortunately will not tell the Scheduler about it.
Without the fix, the Scheduler is hung waiting for the slew to finish and must be stopped manually.
With the fix, the Scheduler notices the DBus communication error, aborts the running job and attempts to restart Ekos and reconnect to INDI.
Several test runs are needed to kill the Simulator during different stages of the job execution.

Testing systematically can be done with a parallel command such as the following

$ while true ; do sleep 1 ; if ps -aef | grep indiserver | grep -v grep ; then sleep <delay> ; killall indiserver ; fi ; done

In which <delay> is the delay to wait until the indiserver is killed.
Tested OK with various delays, killing indiserver while slewing, focusing, guiding, etc.
Raised a crash issue when killing a local indiserver while the INDI interface is talking through a pipe as https://bugs.kde.org/show_bug.cgi?id=397774

Reviewers: mutlaqja, wreissenberger

Reviewed By: mutlaqja

Subscribers: kde-edu

Tags: KDE Edu

Differential Revision: https://phabricator.kde.org/D14965