Making Scheduler robust against guiding problems
ClosedPublic

Authored by wreissenberger on Mar 4 2019, 8:53 PM.

Details

Summary

In the current implementation it could happen, that guiding problems bring the Scheduler and the Capture module out of sync. If capturing is suspended, a subsequent guiding error remains undetected. As a consequence, capturing remains suspended and guiding will not be restarted.

This situation is not untypical during a night with high cirrus that occasionally disturb guiding. First they lead to guiding deviations and from time to time to lost guiding stars. In such nights it could happen, that a single cirrus cloud interrupts the entire capturing.

In order to cover this, the behaviour of Capture and Scheduler is changed:

  • Aborted guiding leads to aborted capturing even when capturing is already suspended.
  • Repeated guiding problems lead to jobs being aborted, not to be marked with errors.
  • In order to avoid aborted jobs to be immediately restarted, the scheduler considers aborted jobs for restart, if all potentially executable jobs have been aborted.

Additionally, two other weaknesses have been resolved:

  • Removed restarting aborted capture when guiding resumes, since it conflicts with the Scheduler.
  • In the PHD2 adapter, connectEquipment() may be called arbirtrarily often when equipment is connected.
Test Plan
  • Create a schedule with at least two schedules, that may both be executed and start the scheduler with guiding enabled. The schedules should use an imaging plan with a narrow guiding deviation set (i.e. 1 arcsec)
  • Change the guiding parameters such that guiding is bad and creates a guiding deviation above the limit.
  • Interrupt the guiding.
  • The scheduler should restart now guiding and then capturing.
  • Interrupt the guiding for another 4 times.
  • After the last interruption, the scheduler should set the schedule to ABORT ad take the next schedule.
  • After 5 guiding interruptions, the second schedule should be set to ABORT.
  • When all schedules are ABORTed, one they should all be set to SCHEDULED and one of them being restarted.

Diff Detail

Repository
R321 KStars
Branch
EKOS/robust_guiding#2
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 9257
Build 9275: arc lint + arc unit
wreissenberger created this revision.Mar 4 2019, 8:53 PM
Restricted Application added a project: KDE Edu. · View Herald TranscriptMar 4 2019, 8:53 PM
Restricted Application added a subscriber: kde-edu. · View Herald Transcript
wreissenberger requested review of this revision.Mar 4 2019, 8:53 PM

Updated to latest master version.

That's great! A few users reported issues regarding this problem.

What about when internal guider loses a star and reacquires it? I don't believe that's registered as an aborted guiding, right?

Oh wow, so you investigated that too? I have the same observation about aborted jobs! Did you have a look at D19393? Coincidence :)

I need some time to examine the part about guiding : we need the suspension feature to be usable both with and without scheduler, and reading this I don't readily understand if that's OK.

About aborted jobs set for restart, the approach here is slightly different from D19393. D19393 is not trying to restart aborted jobs, only making sure they don't interfere with other jobs, specifically when the scheduler is running. I needed to include the scheduler running/not running part, but I don't recall why now (I'm in business trip).

And when I extended D19393 to reinclude aborted jobs when all went aborted (same idea as yours) I got state instabilities when the altitude restriction kicked in. The altitude restriction has a cutoff when scheduling, and not when running, to cater with preliminary steps requiring time before capture empirically.

So that's cool, but needs careful tests. I'll try to do that beginning of this week with the simulators.

That's great! A few users reported issues regarding this problem.

Good to know that I'm not alone :-)

What about when internal guider loses a star and reacquires it? I don't believe that's registered as an aborted guiding, right?

I haven't found a way hot to test lost guiding star directly. As far as I know at least PHD2 tries to re-aquire a guiding star. If this fails within a certain amount of time, it aborts.

Oh wow, so you investigated that too? I have the same observation about aborted jobs! Did you have a look at D19393? Coincidence :)

Nope. Maybe you should add me as a reviewer? :-)

I need some time to examine the part about guiding : we need the suspension feature to be usable both with and without scheduler, and reading this I don't readily understand if that's OK.

Suspending works in both modes. If used with the scheduler, the scheduler thinks simply that capturing is running although capturing is suspended. Restarting a suspended guiding is handled by the capture module.

About aborted jobs set for restart, the approach here is slightly different from D19393. D19393 is not trying to restart aborted jobs, only making sure they don't interfere with other jobs, specifically when the scheduler is running. I needed to include the scheduler running/not running part, but I don't recall why now (I'm in business trip).

Ah, interesting, I have to take a closer look at it.

So that's cool, but needs careful tests. I'll try to do that beginning of this week with the simulators.

Fully agreed! One thing with aborted jobs is not so nice currently: They are put at the end of the list, i.e. sorting of targets is changed, when aborted jobs get restarted.

Oh wow, so you investigated that too? I have the same observation about aborted jobs! Did you have a look at D19393? Coincidence :)

Nope. Maybe you should add me as a reviewer? :-)

That patch was unfinished, but my obs setup accepts a list of differentials to apply when updating itself so I pushed it without reviewers. I like the fact that we both found the same issue at that location :)

I need some time to examine the part about guiding : we need the suspension feature to be usable both with and without scheduler, and reading this I don't readily understand if that's OK.

Suspending works in both modes. If used with the scheduler, the scheduler thinks simply that capturing is running although capturing is suspended. Restarting a suspended guiding is handled by the capture module.

Agreed, that's really cool.

About aborted jobs set for restart, the approach here is slightly different from D19393. D19393 is not trying to restart aborted jobs, only making sure they don't interfere with other jobs, specifically when the scheduler is running. I needed to include the scheduler running/not running part, but I don't recall why now (I'm in business trip).

One thing with aborted jobs is not so nice currently: They are put at the end of the list, i.e. sorting of targets is changed, when aborted jobs get restarted.

I propose you keep your differential focused on restoring functionality after a guiding failure, even if the aborted job isn't managed that well afterwards (that means not bypassing of aborted jobs at the beginning of evaluation).
I will rebase D19393 on yours, and add a fix to the block removing jobs that are not to be evaluated, so that aborted jobs are kept in place and not touched until they are the only ones remaining.
In this context, I will tackle both the re-evaluation without order change and possible state instabilities like the altitude restriction causing the job to repeatedly abort and reschedule.

What do you think?

I propose you keep your differential focused on restoring functionality after a guiding failure, even if the aborted job isn't managed that well afterwards (that means not bypassing of aborted jobs at the beginning of evaluation).
I will rebase D19393 on yours, and add a fix to the block removing jobs that are not to be evaluated, so that aborted jobs are kept in place and not touched until they are the only ones remaining.
In this context, I will tackle both the re-evaluation without order change and possible state instabilities like the altitude restriction causing the job to repeatedly abort and reschedule.

What do you think?

Agreed. James should keep in mind, that your fix should be landed shortly after mine is merged. Without fixing handling of aborted jobs, capture first tries to restart guiding five times, then aborts the job and as a next step restarts it again. That's not nice, but we can live with if for a short timeframe.

Agreed.
It seems the situation is still better than before with your patch. I should be able to push a diff tonight, based on yours. I can rework the part where you bypass the aborted job, and eventually rebase if you change your diff. No problem.

I think I could easily separate restarting of aborted jobs from the rest quite easily. Just give me 1-2 h to check it...

Restricted to handling aborted guiding.

Thanks, I'll start from here.

TallFurryMan accepted this revision.Mar 8 2019, 6:28 PM
This revision is now accepted and ready to land.Mar 8 2019, 6:28 PM

Hello, could we get this D19528 and then D19393 merged? We'll continue from that baseline, which is factually better than the trunk state. Thanks!

Hello, could we get this D19528 and then D19393 merged? We'll continue from that baseline, which is factually better than the trunk state. Thanks!

Whom do you mean with we? Generally spoken, yes, makes sense.

That was a "we" for both authors, but the incentive is of course for Jasem to merge if he accepts :)

mutlaqja accepted this revision.Mar 17 2019, 3:54 PM
mutlaqja updated this revision to Diff 54104.Mar 17 2019, 4:50 PM

Fixed conflict

This revision was automatically updated to reflect the committed changes.

This diff contains a severe bug. Please apply D19840 to resolve it.