Details

Reviewers

mutlaqja
TallFurryMan

Commits

R321:410f0226f002: Making Scheduler robust against guiding problems

Summary

In the current implementation it could happen, that guiding problems bring the Scheduler and the Capture module out of sync. If capturing is suspended, a subsequent guiding error remains undetected. As a consequence, capturing remains suspended and guiding will not be restarted.

This situation is not untypical during a night with high cirrus that occasionally disturb guiding. First they lead to guiding deviations and from time to time to lost guiding stars. In such nights it could happen, that a single cirrus cloud interrupts the entire capturing.

In order to cover this, the behaviour of Capture and Scheduler is changed:

Aborted guiding leads to aborted capturing even when capturing is already suspended.
Repeated guiding problems lead to jobs being aborted, not to be marked with errors.
In order to avoid aborted jobs to be immediately restarted, the scheduler considers aborted jobs for restart, if all potentially executable jobs have been aborted.

Additionally, two other weaknesses have been resolved:

Removed restarting aborted capture when guiding resumes, since it conflicts with the Scheduler.
In the PHD2 adapter, connectEquipment() may be called arbirtrarily often when equipment is connected.

Test Plan

Create a schedule with at least two schedules, that may both be executed and start the scheduler with guiding enabled. The schedules should use an imaging plan with a narrow guiding deviation set (i.e. 1 arcsec)
Change the guiding parameters such that guiding is bad and creates a guiding deviation above the limit.
Interrupt the guiding.
The scheduler should restart now guiding and then capturing.
Interrupt the guiding for another 4 times.
After the last interruption, the scheduler should set the schedule to ABORT ad take the next schedule.
After 5 guiding interruptions, the second schedule should be set to ABORT.
When all schedules are ABORTed, one they should all be set to SCHEDULED and one of them being restarted.

Diff Detail

Repository

R321 KStars

Branch

EKOS/robust_guiding#2

Lint

No Linters Available

Unit

No Unit Test Coverage

Build Status

Buildable 9257
Build 9275: arc lint + arc unit

wreissenberger created this revision.Mar 4 2019, 8:53 PM

Restricted Application added a project: KDE Edu. · View Herald TranscriptMar 4 2019, 8:53 PM

Restricted Application added a subscriber: kde-edu. · View Herald Transcript

wreissenberger requested review of this revision.Mar 4 2019, 8:53 PM

Harbormaster completed remote builds in B9170: Diff 53165.Mar 4 2019, 8:53 PM

Updated to latest master version.

Harbormaster completed remote builds in B9175: Diff 53171.Mar 4 2019, 10:18 PM

That's great! A few users reported issues regarding this problem.

What about when internal guider loses a star and reacquires it? I don't believe that's registered as an aborted guiding, right?

Oh wow, so you investigated that too? I have the same observation about aborted jobs! Did you have a look at D19393? Coincidence :)

I need some time to examine the part about guiding : we need the suspension feature to be usable both with and without scheduler, and reading this I don't readily understand if that's OK.

About aborted jobs set for restart, the approach here is slightly different from D19393. D19393 is not trying to restart aborted jobs, only making sure they don't interfere with other jobs, specifically when the scheduler is running. I needed to include the scheduler running/not running part, but I don't recall why now (I'm in business trip).

And when I extended D19393 to reinclude aborted jobs when all went aborted (same idea as yours) I got state instabilities when the altitude restriction kicked in. The altitude restriction has a cutoff when scheduling, and not when running, to cater with preliminary steps requiring time before capture empirically.

So that's cool, but needs careful tests. I'll try to do that beginning of this week with the simulators.

In D19528#424928, @mutlaqja wrote:

That's great! A few users reported issues regarding this problem.

Good to know that I'm not alone :-)

What about when internal guider loses a star and reacquires it? I don't believe that's registered as an aborted guiding, right?

I haven't found a way hot to test lost guiding star directly. As far as I know at least PHD2 tries to re-aquire a guiding star. If this fails within a certain amount of time, it aborts.

In D19528#424942, @TallFurryMan wrote:

Oh wow, so you investigated that too? I have the same observation about aborted jobs! Did you have a look at D19393? Coincidence :)

Nope. Maybe you should add me as a reviewer? :-)

I need some time to examine the part about guiding : we need the suspension feature to be usable both with and without scheduler, and reading this I don't readily understand if that's OK.

Suspending works in both modes. If used with the scheduler, the scheduler thinks simply that capturing is running although capturing is suspended. Restarting a suspended guiding is handled by the capture module.

About aborted jobs set for restart, the approach here is slightly different from D19393. D19393 is not trying to restart aborted jobs, only making sure they don't interfere with other jobs, specifically when the scheduler is running. I needed to include the scheduler running/not running part, but I don't recall why now (I'm in business trip).

Ah, interesting, I have to take a closer look at it.

So that's cool, but needs careful tests. I'll try to do that beginning of this week with the simulators.

Fully agreed! One thing with aborted jobs is not so nice currently: They are put at the end of the list, i.e. sorting of targets is changed, when aborted jobs get restarted.

Oh wow, so you investigated that too? I have the same observation about aborted jobs! Did you have a look at D19393? Coincidence :)

Nope. Maybe you should add me as a reviewer? :-)

That patch was unfinished, but my obs setup accepts a list of differentials to apply when updating itself so I pushed it without reviewers. I like the fact that we both found the same issue at that location :)

I need some time to examine the part about guiding : we need the suspension feature to be usable both with and without scheduler, and reading this I don't readily understand if that's OK.

Suspending works in both modes. If used with the scheduler, the scheduler thinks simply that capturing is running although capturing is suspended. Restarting a suspended guiding is handled by the capture module.

Agreed, that's really cool.

About aborted jobs set for restart, the approach here is slightly different from D19393. D19393 is not trying to restart aborted jobs, only making sure they don't interfere with other jobs, specifically when the scheduler is running. I needed to include the scheduler running/not running part, but I don't recall why now (I'm in business trip).

One thing with aborted jobs is not so nice currently: They are put at the end of the list, i.e. sorting of targets is changed, when aborted jobs get restarted.

I propose you keep your differential focused on restoring functionality after a guiding failure, even if the aborted job isn't managed that well afterwards (that means not bypassing of aborted jobs at the beginning of evaluation).
I will rebase D19393 on yours, and add a fix to the block removing jobs that are not to be evaluated, so that aborted jobs are kept in place and not touched until they are the only ones remaining.
In this context, I will tackle both the re-evaluation without order change and possible state instabilities like the altitude restriction causing the job to repeatedly abort and reschedule.

What do you think?

I propose you keep your differential focused on restoring functionality after a guiding failure, even if the aborted job isn't managed that well afterwards (that means not bypassing of aborted jobs at the beginning of evaluation).
I will rebase D19393 on yours, and add a fix to the block removing jobs that are not to be evaluated, so that aborted jobs are kept in place and not touched until they are the only ones remaining.
In this context, I will tackle both the re-evaluation without order change and possible state instabilities like the altitude restriction causing the job to repeatedly abort and reschedule.

What do you think?

Agreed. James should keep in mind, that your fix should be landed shortly after mine is merged. Without fixing handling of aborted jobs, capture first tries to restart guiding five times, then aborts the job and as a next step restarts it again. That's not nice, but we can live with if for a short timeframe.

Agreed.
It seems the situation is still better than before with your patch. I should be able to push a diff tonight, based on yours. I can rework the part where you bypass the aborted job, and eventually rebase if you change your diff. No problem.

I think I could easily separate restarting of aborted jobs from the rest quite easily. Just give me 1-2 h to check it...

Restricted to handling aborted guiding.

Harbormaster completed remote builds in B9257: Diff 53314.Mar 6 2019, 8:52 PM

Thanks, I'll start from here.

Is this good to go or better wait until D19393 is complete?

TallFurryMan accepted this revision.Mar 8 2019, 6:28 PM

This revision is now accepted and ready to land.Mar 8 2019, 6:28 PM

Hello, could we get this D19528 and then D19393 merged? We'll continue from that baseline, which is factually better than the trunk state. Thanks!

Hello, could we get this D19528 and then D19393 merged? We'll continue from that baseline, which is factually better than the trunk state. Thanks!

Whom do you mean with we? Generally spoken, yes, makes sense.

That was a "we" for both authors, but the incentive is of course for Jasem to merge if he accepts :)

mutlaqja accepted this revision.Mar 17 2019, 3:54 PM

Fixed conflict

Harbormaster completed remote builds in B9728: Diff 54104.Mar 17 2019, 4:50 PM

Closed by commit R321:410f0226f002: Making Scheduler robust against guiding problems (authored by mutlaqja). · Explain WhyMar 17 2019, 4:50 PM

This revision was automatically updated to reflect the committed changes.

wreissenberger mentioned this in D19840: Bugfix: proper usage of abort() for finishing a capture sequence queue.Mar 17 2019, 6:41 PM

This diff contains a severe bug. Please apply D19840 to resolve it.

mutlaqja mentioned this in R321:bd1d4d28b1cc: Bugfix: proper usage of abort() for finishing a capture sequence queue.Mar 18 2019, 5:09 AM

			Path	Packages
M			kstars/ekos/capture/capture.h (1 line)
M			kstars/ekos/capture/capture.cpp (19 lines)
M			kstars/ekos/guide/externalguide/phd2.cpp (8 lines)
M			kstars/ekos/scheduler/scheduler.cpp (2 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	53165	1c9157e		Mar 4 2019, 8:53 PM	★	★
Diff 2	53171	9947d6f	Updated to latest master version.	Mar 4 2019, 10:18 PM	★	★
Diff 3	53314	7d0e571	Restricted to handling aborted guiding.	Mar 6 2019, 8:52 PM	★	★
Diff 4	54104	f52ef34	Fixed conflict	Mar 17 2019, 4:50 PM	★	★
Diff 5	54105	f52ef34	R321:410f0226f002d25004a0db5ef2d33ff79460dd81	Mar 17 2019, 4:50 PM	★	★

Commit	Tree	Parents	Author	Summary	Date
b3d1374b6517	e8c69321a3cf	87bd3553453c	Wolfgang Reissenberger	Aborted guiding during suspended capturing aborts capturing	Mar 6 2019, 8:48 PM
87bd3553453c	881f9fad2e6f	bcaf0c6ac1d6	Wolfgang Reissenberger	Removed restarting aborted capture when guiding resumes, since it conflicts… (Show More…)	Mar 4 2019, 12:51 PM
bcaf0c6ac1d6	6b88ce4d48dd	7d0e57198346	Wolfgang Reissenberger	connectEquipment() may be called arbirtrarily often when equipment is connected	Mar 4 2019, 10:08 PM

Making Scheduler robust against guiding problems
ClosedPublic
Actions

Details

Diff Detail

Revision Contents
Changeset List

Diff 53314

kstars/ekos/capture/capture.h

kstars/ekos/capture/capture.cpp

kstars/ekos/guide/externalguide/phd2.cpp

kstars/ekos/scheduler/scheduler.cpp

Making Scheduler robust against guiding problemsClosedPublicActions

Details

Diff Detail

Revision ContentsChangeset List

Diff 53314

kstars/ekos/capture/capture.h

kstars/ekos/capture/capture.cpp

kstars/ekos/guide/externalguide/phd2.cpp

kstars/ekos/scheduler/scheduler.cpp

Making Scheduler robust against guiding problems
ClosedPublic
Actions

Revision Contents
Changeset List