Improve Scheduler robustness against INDI disconnections
ClosedPublic
Actions

Authored by TallFurryMan on Aug 21 2018, 7:16 AM.

Details

Reviewers

mutlaqja
wreissenberger

Commits

R321:0fa94d809979: Improve Scheduler robustness against INDI disconnections

Summary

In the case Ekos loses connection to INDI during the shutdown procedure, bypass parking procedure and proceed to execute the shutdown script.
When a DBus error occurs while trying to control INDI devices (slewing/tracking, guiding, focusing or capturing), abort the current job, disconnect INDI (in terms of state machine) and stop Ekos.
Make Scheduler timer verify Ekos and INDI state, so that communication failures may be recovered from immediately by restarting Ekos and restarting INDI.

A few situations can lead to INDI disconnections:

A transitory network issue that closes the TCP stream, in which case reopening it returns to normal state (either with the running job continuing, or the running job aborted).
A serious network issue that prevents access to the INDI server, in which case Ekos will fail to restart and the Scheduler will stop.
A crash of one of the drivers, in which case Ekos might be able to reconnect on a new instance of the driver, or will loop trying to use the missing driver until it comes up again.

Obviously, it is difficult to properly handle all situations.
For instance when capturing, it may happen that the CCD driver remains in capture mode, with Ekos not being able to recover.
It may happen that the disconnection does not trigger a DBus error, but is caught while the Scheduler is checking the state of the job.
In that situation, Ekos might keep a particular state of control of a feature, but the crash might reset the properties of this feature, that state becomes invalid and unusable.

Because this robustness improvement only triggers when a communication error occurs, it is not expected to have side-effects on the normal behavior of the Scheduler.

Another issue is currently preventing all combinations of tests from being processed: the Profile field of the scheduler job is not properly handled.
It is currently not possible to have different scheduler jobs using different profiles, and once it uses a particular profile, the Scheduler is unable to switch to another by itself.

An additional issue on parking states was fixed in this differential, because the mitigation process made it very clear and easy to reproduce. The temporary workaround that was to try to unpark again when the mount was found unparked when slewing, was removed.

Test Plan

Create a scheduler job using the Simulator, with Tracking enabled, to give the tester time to kill the Simulator server.

Example of test session:
Start the Scheduler, and when it connects and starts to slew, use a terminal to find the PID of the INDI server ("ps -aef | grep indiserver") and kill it ("kill <pid>").
Ekos will immediately register the disconnection, but unfortunately will not tell the Scheduler about it.
Without the fix, the Scheduler is hung waiting for the slew to finish and must be stopped manually.
With the fix, the Scheduler notices the DBus communication error, aborts the running job and attempts to restart Ekos and reconnect to INDI.
Several test runs are needed to kill the Simulator during different stages of the job execution.

Testing systematically can be done with a parallel command such as the following

$ while true ; do sleep 1 ; if ps -aef | grep indiserver | grep -v grep ; then sleep <delay> ; killall indiserver ; fi ; done

In which <delay> is the delay to wait until the indiserver is killed.
Tested OK with various delays, killing indiserver while slewing, focusing, guiding, etc.
Raised a crash issue when killing a local indiserver while the INDI interface is talking through a pipe as https://bugs.kde.org/show_bug.cgi?id=397774

Diff Detail

Repository

R321 KStars

Branch

bugfix__shutdown_parking_with_no_indi (branched from master)

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 2144
Build 2162: arc lint + arc unit

TallFurryMan created this revision.Aug 21 2018, 7:16 AM

Restricted Application added a project: KDE Edu. · View Herald TranscriptAug 21 2018, 7:16 AM

Restricted Application added a subscriber: kde-edu. · View Herald Transcript

TallFurryMan requested review of this revision.Aug 21 2018, 7:16 AM

Harbormaster completed remote builds in B2057: Diff 40115.Aug 21 2018, 7:16 AM

When the scheduler attempts to connect to INDI again after a disconnection, is indiConnectionFailureCount reset? Will scheduler stop after it exhausts attempts MAX_FAILURE_ATTEMPTS or keeps looping forever?

kstars/ekos/scheduler/scheduler.cpp
3181–3186 ↗	(On Diff #40115)	Code appears to be repeated multiple times. Maybe make it into a function with descriptive name?

The clear duplication of code is intentional, so that we can refactor at a later step. There are certainly other locations that could benefit from that scheme and that I did not spot yet.

From the execution flow, yes, reconnecting INDI does honor the failure counter. However, it is not possible to test this with the Simulator as that server always starts immediately and successfully. Admittedly, I should have tested on a real setup that I could disconnect.

So, tested with regular remote indiserver, and result is not that positive, but not negative neither.

First, cutting the network off has no particular impact on the INDI connection. What it does is it freezes interactions with drivers. If the mount is slewing, the dbus error makes the Scheduler disconnect the server. If the CCD is capturing, absolutely nothing happens, and the exposure counter just stays there at the same value forever. The INDI connection is never tested by Ekos.

Second, stopping the remote server is properly registered by Ekos and the INDI connection is closed. Scheduler is seeing this, and retries to connect through Ekos. You were referring to indiConnectionFailureCount, but this is not involved there. That count is only used when connecting devices. Ekos doesn't attempt to connect more than once.

So basically, differential works, but more safeguards need to be installed. Capture probably needs a timeout for instance, as it doesn't seem to be polling the driver, but merely receiving notifications.

Adding Ekos failure count too. A few more tests showed that the state of the current job remains at "aborted" after the first disconnection, which is incorrect.

Add a failure count to Ekos connection.
Refactor connection loss mitigation, keep current job running during loss.
Test connection before mitigating loss.
Apply connection loss mitigation to more DBus errors.
Adding logs while testing mitigation.
Check Ekos and INDI states before mitigating loss of connection.
Fix issue on parking state, which was worked around before but now is clear.
Remove workaround managing unexpected park state while slewing.

Harbormaster completed remote builds in B2144: Diff 40276.Aug 23 2018, 6:58 AM

TallFurryMan edited the summary of this revision. (Show Details)Aug 23 2018, 7:07 AM

TallFurryMan edited the test plan for this revision. (Show Details)

mutlaqja accepted this revision.Aug 23 2018, 9:00 AM

This revision is now accepted and ready to land.Aug 23 2018, 9:00 AM

Closed by commit R321:0fa94d809979: Improve Scheduler robustness against INDI disconnections (authored by TallFurryMan, committed by mutlaqja). · Explain WhyAug 23 2018, 9:03 AM

This revision was automatically updated to reflect the committed changes.

I realize my test plan is lacking one item: I did not test that the job aborting because of twilight was properly stopping actions and guiding.
The code flow is doing that, but the test is missing from my checklist.

TallFurryMan mentioned this in D15073: Fix parking engine, and make observatory startup job-centric.Aug 26 2018, 6:47 AM

Revision Contents
Changeset List

			Path	Packages
M			kstars/ekos/scheduler/scheduler.h (8 lines)
M			kstars/ekos/scheduler/scheduler.cpp (287 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	40115	279e64a		Aug 21 2018, 7:16 AM	★	★
Diff 2	40276	e3bec03	- Add a failure count to Ekos connection.	Aug 23 2018, 6:58 AM	★	★
Diff 3	40282	e3bec03	R321:0fa94d809979a1274dd0f0149a633cfaed4e2862	Aug 23 2018, 9:01 AM	★	★

Commit	Tree	Parents	Author	Summary	Date
35fa3d41a852	8f63c2c141e1	ef79caf2c00d	Eric Dejouhanet	Remove workaround managing unexpected park state while slewing.	Aug 23 2018, 6:57 AM
ef79caf2c00d	7f6e10e2e497	559e7ac6dd8f	Eric Dejouhanet	Fix issue on parking state, which was worked around before but now is clear.	Aug 23 2018, 6:56 AM
559e7ac6dd8f	d579a0f3ef83	2855745869fe	Eric Dejouhanet	Check Ekos and INDI states before mitigating loss of connection.	Aug 23 2018, 6:54 AM
2855745869fe	b4c5213bfbf3	d2268c3c2d5b	Eric Dejouhanet	Adding logs while testing mitigation.	Aug 22 2018, 11:49 PM
d2268c3c2d5b	198014976b82	0e613e3fa08d	Eric Dejouhanet	Apply connection loss mitigation to more DBus errors.	Aug 22 2018, 11:35 PM
0e613e3fa08d	179ae44dafb8	7dedd5eb21ff	Eric Dejouhanet	Test connection before mitigating loss.	Aug 22 2018, 11:34 PM
7dedd5eb21ff	450ae02cf39e	1f065198ce46	Eric Dejouhanet	Refactor connection loss mitigation, keep current job running during loss.	Aug 22 2018, 10:14 PM
1f065198ce46	5c5b12e61287	9608f0288f72	Eric Dejouhanet	Add a failure count to Ekos connection.	Aug 22 2018, 10:13 PM
9608f0288f72	b840ccd05600	991ab62f9071	Eric Dejouhanet	Improve Scheduler robustness against INDI disconnections (Show More…)	Aug 21 2018, 6:52 AM
991ab62f9071	011e9e80f6c7	cb64abb5c28f	Eric Dejouhanet	Abort jobs on DBus errors for the Scheduler to retry.	Aug 20 2018, 6:58 AM
cb64abb5c28f	ae87f5d6dec4	e3bec03e20db	Eric Dejouhanet	Bypass parking when shutting down if INDI is down.	Aug 18 2018, 8:09 PM

Improve Scheduler robustness against INDI disconnectionsClosedPublicActions

Details

Diff Detail

Revision ContentsChangeset List

Diff 40276

kstars/ekos/scheduler/scheduler.h

kstars/ekos/scheduler/scheduler.cpp

Improve Scheduler robustness against INDI disconnections
ClosedPublic
Actions

Revision Contents
Changeset List