Adding error handling strategy control to Scheduler
ClosedPublic
Actions

Authored by wreissenberger on Jul 13 2019, 7:17 PM.

Details

Reviewers

mutlaqja
TallFurryMan

Commits

R321:a8dd7687ca16: Adding error handling strategy control to Scheduler

Summary

Currently, the Scheduler is not very robust against temporary problems during an imaging session. Passing clouds might create situations, where either the Scheduler terminates or the different modules get out of sync with there state. As a result, some passing clouds can destroy an entire session.

This patch adds the option to control, how the Scheduler should handle aborted jobs:

Try to restart them directly after a configurable delay.
Continue with the other scheduled jobs. As soon as no jobs are scheduled, try to restart the aborted jobs (current behaviour, but with the additional option of a configurable delay).
Do not restart aborted jobs at all.

Additionally, there is an option to handle errors like aborted jobs and try to restart them. Whether selecting this option depends on the individual technical setup. In my case, I experience from time to time errors when slewing to a target. Restarting it does not create any problem at all. But it might be the case, that with other setups its dangerous to ignore an error and expose the equipment to the same error situation again and again. That's why I added this as an option.

Third thing that I changed is re-sorting error jobs at the end of the schedule. This only makes sense if we do not want to restart them. Otherwise, re-sorting them disturbs the intentionally set order of jobs. Therefore, I disabled this feature.

Last thing: there were some guiding and focusing problems that led to an error state. I changed their result to the aborted state.

Test Plan

Create a schedule with several jobs and try to provoque aborts and errors:

Aborts with guiding might be created by sending a small move signal or by setting the imaging time of the guider to a very small time.
Errors are more tricky. One option is using the debugger, setting a break point for example in setMountStatus() and jumping to a line where an error is handled.

Diff Detail

Repository

R321 KStars

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a project: KDE Edu. · View Herald TranscriptJul 13 2019, 7:17 PM

Restricted Application added a subscriber: kde-edu. · View Herald Transcript

wreissenberger requested review of this revision.Jul 13 2019, 7:17 PM

Harbormaster completed remote builds in B13916: Diff 61717.Jul 13 2019, 7:17 PM

Please see my comments. I'm on with the idea :)
Have you seen my post on the forum about the future of scheduler? We need to talk about this at some point.

kstars/ekos/capture/capture.cpp
647	OK, but why did you move the test?
kstars/ekos/scheduler/scheduler.cpp
1431–1432	Here I tend to disagree. We have JOB_INVALID, JOB_ERROR and JOB_ABORTED. JOB_INVALID indicates a configuration error, and we cannot proceed with the job at all. JOB_ERROR indicates a fatal error during processing, and we cannot proceed with the job anymore. JOB_ABORTED indicates a transitory error during processing, and we should retry at some point. Based on these definitions, we should not re-evaluate JOB_ERROR. Now, I know that the enums are not properly ordered, but to me JOB_ERROR jobs should still be removed.
1537	OK for this block, interesting feature indeed.
1579	I'm not sure about this block, it really gets in the way of the regular method of scheduling. It will conflict with the completion time of the aborted jobs. It will run a parallel conflict with whatever restriction makes the jobs aborted at some point. It will conflict with the preemptive shutdown option (unless you forbid a larger delay?). Eventually, forbid a delay larger than the lead time option?
1602–1603	OK, only if scheduling algorithm takes the presence of JOB_ABORTED into account.
2065–2066	OK, only if scheduling algorithm takes the presence of JOB_ABORTED into account.
3173	This is a behavior change, but well, makes sense now. Note that currently, the completion date embeds both date and time. The completion date should be optional, so that multi-day schedules could be more easily programmed (I have a need for this). The interface is painful when rescheduling START_AT and COMPLETE_AT jobs, but not moving jobs should improve that.
3199	I disagree: this job must be rescheduled to next observation time, thus JOB_ABORTED.
3225	I disagree: this job must be rescheduled to next observation time (albeit quite a far one!), thus JOB_ABORTED.
3244	I disagree: this job must be rescheduled to the next observation time, thus JOB_ABORTED.
3900	OK! I thought this was already in.
3918	OK.
3920	Enum-to-int conversions are dangerous, we should review this method at some point...
3930	The default value "true" is a bit lost in the code here, can we do better? Line 3942 for instance, startup procedure block, has the boolean flag present or not in the "StartupProcedure" element. The default is clear at lines 3929. Wouldn't that be better for maintenance?
4465	As mentioned earlier, JOB_ERROR jobs should be left cancelled per definition.
4487	Same issue as described earlier.
4498	Comment on the reason of the setCurrentJob move?
5955	OK
6848–6849	OK! But change the log.
6985–6986	OK! But change the log.
kstars/ekos/scheduler/scheduler.ui
1039–1040	Sorry I didn't check the UI by importing the diff. Will do.

This revision now requires changes to proceed.Jul 14 2019, 8:51 AM

Many thanks for your comments. I will think about them and come back.

kstars/ekos/scheduler/scheduler.cpp
1431–1432	I would agree if JOB_ERROR is only used when an error is non recoverable. But for example what error state should be set if a slew command fails? It is definitely recoverable, since sending the command again solves the problem. The same is true if e.g. we loose network connection. The problem is, that we do not have a JOB_FATAL state. For those, I absolutely agree, we should not try to restart. That's why I added the option to handle errors like aborts. I am quite sure that there are setups where restarting of errors is dangerous. But with my setting, its absolutely fine.
1579	This section simply extends the old predicate based to a loop so that it can take into account whether the "re-schedule errors" checkbox is selected. I needed to switch to a loop since the checkbox is not static.
1602–1603	Sorry, don't get the point.
3173	Oops, multi-day schedules? That's tricky.
3199	Maybe neither COMPLETE nor ABORTED. What about if we set it to EVALUATION?
3225	see above on line 3191.

We need to decide whether we want the scheduler to be able handling multi-day schedules (as currently, without this change) or being able to restart immediately aborted jobs. The latter is important for a robust scheduling during nights with some clouds, where a job gets aborted, but may be continue, as soon as the cloud has passed by.

kstars/ekos/scheduler/scheduler.cpp
3173	I'm afraid with multi-day schedules we have a conceptual problem with manually sorted lists. As soon as a job reaches its completion time we need to change the order. Otherwise re-scheduling it to the next day would mean that the following job is also postponed to the next day. From that perspective, it makes sense shifting aborted jobs behind the scheduled ones. But this behavior conflicts with the option to immediately restarting an aborted job. Personally, I would opt to resolve multi-day schedules differently - e.g. by an external cron job. Otherwise we would lose the ability to restart aborted jobs immediately - which I see as essential having a robust schedule in nights with clouds.
3199	I have to correct, EVALUATION does not help - see comment above. If we have the option for immediate restart selected, aborting a job with completion time exceeded leads to the situation, that it is scheduled for the following day and the next job as consequence will be shifted to the next day as well - which is not the desired behavior.

Serializing error strategy corrected, log messages for aborted jobs unified

Harbormaster completed remote builds in B13941: Diff 61753.Jul 14 2019, 8:02 PM

Serialization changed as suggested, log messages corrected. Now we need to agree how to proceed with the problem, that restarting aborted jobs immediately conflicts with the idea of having multi-day schedules.

@mutlaqja: what do you think?

kstars/ekos/scheduler/scheduler.cpp
3920	Agreed.
3930	Good point, changed.
4498	There's no reason for - reverting it.

Update to the discussion regarding aborted jobs and limits.

kstars/ekos/scheduler/scheduler.cpp
3173	Please take a look into this discussion: https://indilib.org/forum/general/5423-potential-bug-in-the-scheduler-mount-doesn-t-get-parked-when-guiding-aborted.html It seems like the current behavior is not very intuitive regarding restarting jobs that hit limiting constraints.

Sorry I need more time to come back on this. The end of week is hectic.

ping :-)

Eric, any feedback?

Owww I totally forgot about this one. Not that I would have been able to test much that said. Let me try to review this tomorrow.

I‘ve been using it now since several weeks and really appreciate the „restart immediately“ function. I typically have 2-3 different targets per night and it is very handy restarting the aborted job after a cloud e.g passed by. In the past few weeks I had several partially cloudy nights and all worked well.

See my comments. This is defintely a better behavior, but I'm still not happy with JOB_ERROR :)

kstars/ekos/scheduler/scheduler.cpp
1431–1432	I believe the current JOB_ERROR and your example JOB_FATAL are equivalent. If the failures of a slew command leads to JOB_ERROR currently and you think it is recoverable, then we should change the state resulting from such failure to JOB_ABORTED.
1579	OK. I'm still worried about the edge cases I listed.
3173	Agreed, there is still work to do. From my tests, the best way to execute multi-day schedules is to automatically order targets per altitude and set targets to capture indefinitely with a high altitude restriction. This needs to be tested with your auto-restart mechanism. If of course someone is willing to spent a few clear nights to let the setup work its schedule on its own :)
3199	I think I understand why you changed that: it conflicts with your auto-restart mechanism. What if you set it to complete when auto-restarting, with a warning, and to aborted when not auto-restarting?
3225	Same as previous comment.
3244	Same as previous comment.
3307	Just a warning not about translations. If you change this, translators need to change stuff, perhaps needlessly.
4465	Still stands.
kstars/ekos/scheduler/scheduler.ui
1039–1040	Didn't test in the end :/

TallFurryMan accepted this revision.Aug 29 2019, 7:24 AM

This revision is now accepted and ready to land.Aug 29 2019, 7:24 AM

I'm marking this OK, but we'll have to return to it. I'm nearly done with my other activity, temporarily, and hope to return soon to Scheduler.

mutlaqja accepted this revision.Aug 31 2019, 8:23 AM

Closed by commit R321:a8dd7687ca16: Adding error handling strategy control to Scheduler (authored by wreissenberger, committed by mutlaqja). · Explain WhyAug 31 2019, 8:23 AM

This revision was automatically updated to reflect the committed changes.

This night I had a 25-job mosaic run. All jobs were constrained with altitude 15° and twilight.
In the morning, job #11 failed to find a guide star after 2 runs and was aborted (OK).
When job #12 crossed the altitude restriction at 15° while focusing, it was incorrectly set to completed instead of aborted (KO).
When job #13 crossed the altitude restriction at 14.3° before starting, it was incorrectly set to completed instead of aborted (KO).
Further jobs were kept as scheduled, and observatory fell asleep as expected (OK).

When job #12 crossed the altitude restriction at 15° while focusing, it was incorrectly set to completed instead of aborted (KO).
When job #13 crossed the altitude restriction at 14.3° before starting, it was incorrectly set to completed instead of aborted (KO).

That's how it was intended from my side (see our discussion here).
I think I got your point now with the multi-day-schedules. As soon as a job is marked with JOB_COMPLETE, it will not be started the next night when the scheduler keeps running for more than one night.

Would it make more sense to set jobs to JOB_IDLE when they hit a constraint but are not complete? By the way, this is what happens when a job has passed its startup time. If we take that solution, the jobs will be re-scheduled the following night when the scheduler keeps running.

Yeah, the issue I haven't mentioned is that those jobs are marked completed without any frame captured, so the behavior is not correct. I believe that setting those as ABORTED would have been OK : in the case of my mosaic, there's no problem retrying the job at the end of the queue, or "immediately", here rescheduled at a valid time.

OK, understood. But why did it happen that no frame has been captured? Was it due to the case that as soon as it would have been there turn to be started, they already hit their altitude or twilight limit, right?

The point is, if we set them to aborted and the user selects "restart immediately", it will endlessly loop. That's why I suggest setting them to idle as long as they a) have not been completed AND b) one of their constraints fails.

If we follow this logic, we define completed as having captured the defined number of frames. All other constraints only talk about whether a job may be executed at a certain point of time.

Does that make sense?

OK, understood. But why did it happen that no frame has been captured? Was it due to the case that as soon as it would have been there turn to be started, they already hit their altitude or twilight limit, right?

Yes, the first job started, proceeded to focus, but in the meantime altitude got under 15°, at 14.6°, and the job aborted (completed).
The second job started at that moment, but immediately saw that its altitude was lower than 15° (I think), and aborted too (completed).

The point is, if we set them to aborted and the user selects "restart immediately", it will endlessly loop. That's why I suggest setting them to idle as long as they a) have not been completed AND b) one of their constraints fails.

In that situation, the job will be re-scheduled because of the restriction. The risk is that the job is re-scheduled far away, potentially blocking subsequent jobs, but that's legal. Re-enqueue would be a better choice here (that's your default in fact), instead of immediately.

If we follow this logic, we define completed as having captured the defined number of frames. All other constraints only talk about whether a job may be executed at a certain point of time.

That could be an idea yes.

Second regression observed minutes ago: when restarting after sleeping, all jobs are re-scheduled, even completed ones. I need to remove the jobs that were completed last night manually, or use Remember job progress.

Second regression observed minutes ago: when restarting after sleeping, all jobs are re-scheduled, even completed ones. I need to remove the jobs that were completed last night manually, or use Remember job progress.

Could you please provide a test case?

Yes, given my conditions, make scheduler complete a job, then abort a second one because of a guide failure. Have the second job re-scheduled far away enough that scheduler falls asleep. Let scheduler wake up to process jobs again. When scheduler wakes up, it re-evaluates jobs, and the completed job is incorrectly reset for execution. Using preemptive shutdown and dawn and dusk offsets could ease that test by reducing the amount of time to wait.

Illustration for the first regression.

I'll provide one for the second tonight when scheduler wakes up around 9:30pm. I tried to extract a log. Unrelated, but I observe that guiding is not stopped when observatory shuts down.

(... job 11 goes under the altitude restriction ...)

[2019-09-18T06:11:21.233 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - Checking job stage for "NGC6974-Part12" startup 2 "18/09/19 05:04" state 3
[2019-09-18T06:11:21.234 CEST INFO ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part12' current altitude (15,00 degrees) crossed minimum constraint altitude (15,00 degrees), marking aborted."
[2019-09-18T06:11:21.261 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - Job ' "NGC6974-Part12" ' is stopping current action... 3
[2019-09-18T06:11:21.284 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - Find next job...
[2019-09-18T06:11:21.284 CEST INFO ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part12' is complete."
[2019-09-18T06:11:21.312 CEST DEBG ][     org.kde.kstars.ekos.focus] - Stopppig Focus
[2019-09-18T06:11:21.312 CEST DEBG ][     org.kde.kstars.ekos.focus] - State: "Aborted"

(... Oh whoops focus is restarting ? Yes, and it will finish with mount parked ...)

[2019-09-18T06:11:22.175 CEST INFO ][     org.kde.kstars.ekos.focus] - "Restarting autofocus process..."
[2019-09-18T06:11:22.184 CEST DEBG ][     org.kde.kstars.ekos.focus] - Starting focus with box size:  64  Subframe:  no  Autostar:  no  Full frame:  yes  [ 0 %, 85 %]  Step Size:  500  Threshold:  150  Tolerance:  10  Frames:  1  Maximum Travel:  6000
[2019-09-18T06:11:22.185 CEST INFO ][     org.kde.kstars.ekos.focus] - "Please wait until image capture is complete..."
[2019-09-18T06:11:22.189 CEST DEBG ][     org.kde.kstars.ekos.focus] - State: "In Progress"
[2019-09-18T06:11:22.191 CEST INFO ][     org.kde.kstars.ekos.focus] - "Capturing image..."
[2019-09-18T06:11:22.208 CEST INFO ][           org.kde.kstars.indi] - MoonLite :  "[INFO] Focuser reached requested position. "
[2019-09-18T06:11:22.296 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - Checking Park Wait State...

(... Now scheduler re-evaluates the other subsequent jobs ...)

[2019-09-18T06:11:22.669 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Schedule attempt #1 for 2325-second job 'NGC6974-Part13' on row #13 starting at 18/09/19 05:45, completing at 18/09/19 06:23."
[2019-09-18T06:11:22.670 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part13' dark sky score is +0 at 18/09/19 06:11"
[2019-09-18T06:11:22.671 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part13' altitude score is -1000 at 18/09/19 06:11"
[2019-09-18T06:11:22.671 CEST INFO ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part13' has a total score of -1000 at 18/09/19 06:11."
[2019-09-18T06:11:22.715 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part13' on row #13 passed all checks after 1 attempts, will proceed at 18/09/19 05:45 for approximately 2325 seconds, marking scheduled"
[2019-09-18T06:11:22.715 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Schedule attempt #1 for 2325-second job 'NGC6974-Part14' on row #14 starting at 18/09/19 21:36, completing at 18/09/19 22:14."

(... Goes on until last job, twice but that's expected, then Scheduler decides to sleep until job 14 ...)

[2019-09-18T06:11:27.005 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part14' dark sky score is +0 at 18/09/19 06:11"
[2019-09-18T06:11:27.006 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part14' altitude score is -1000 at 18/09/19 06:11"
[2019-09-18T06:11:27.006 CEST INFO ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part14' has a total score of -1000 at 18/09/19 06:11."
[2019-09-18T06:11:27.006 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part14' is selected for next observation with priority #10 and score -1000."
[2019-09-18T06:11:27.181 CEST INFO ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part14' scheduled for execution at 18/09/19 21:36. Observatory scheduled for shutdown until next job is ready."
[2019-09-18T06:11:27.207 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - Checking shutdown state...
[2019-09-18T06:11:27.207 CEST INFO ][ org.kde.kstars.ekos.scheduler] - Starting shutdown process...
[2019-09-18T06:11:28.170 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - Checking shutdown state...
[2019-09-18T06:11:28.171 CEST DEBG ][           org.kde.kstars.indi] - ISD:Telescope: Parking...

(... Scheduler falls asleep, with PHD2 still trying to guide until I close the observatory ...)

[2019-09-18T21:36:02.671 CEST INFO ][ org.kde.kstars.ekos.scheduler] - "Scheduler is awake."
[2019-09-18T21:36:02.674 CEST INFO ][ org.kde.kstars.ekos.scheduler] - Scheduler is starting...
[2019-09-18T21:36:03.701 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Searching in path '/home/tallfurryman/Documents/NGC6960/NGC6974-Part1/NGC6974-Part1/Light/H_Alpha', files 'NGC6974-Part1_Light_H_Alpha_480_secs*' for prefix 'NGC6974-Part1_Light_H_Alpha_480_secs'..."
[2019-09-18T21:36:03.724 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "> Found 'NGC6974-Part1_Light_H_Alpha_480_secs_2019-09-17T21-58-56_001'"
[2019-09-18T21:36:03.724 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "> Found 'NGC6974-Part1_Light_H_Alpha_480_secs_2019-09-17T22-24-31_003'"

(... goes one enumerating ...)

[2019-09-18T21:36:03.833 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - Frame map summary:
[2019-09-18T21:36:03.833 CEST DEBG ][ org.kde.kstars.ekos.scheduler] -   "/home/tallfurryman/Documents/NGC6960/NGC6974-Part1/NGC6974-Part1/Light/H_Alpha/NGC6974-Part1_Light_H_Alpha_480_secs" : 4
[2019-09-18T21:36:03.833 CEST DEBG ][ org.kde.kstars.ekos.scheduler] -   "/home/tallfurryman/Documents/NGC6960/NGC6974-Part10/NGC6974-Part10/Light/H_Alpha/NGC6974-Part10_Light_H_Alpha_480_secs" : 4
[2019-09-18T21:36:03.834 CEST DEBG ][ org.kde.kstars.ekos.scheduler] -   "/home/tallfurryman/Documents/NGC6960/NGC6974-Part11/NGC6974-Part11/Light/H_Alpha/NGC6974-Part11_Light_H_Alpha_480_secs" : 2
[2019-09-18T21:36:03.834 CEST DEBG ][ org.kde.kstars.ekos.scheduler] -   "/home/tallfurryman/Documents/NGC6960/NGC6974-Part12/NGC6974-Part12/Light/H_Alpha/NGC6974-Part12_Light_H_Alpha_480_secs" : 0
[2019-09-18T21:36:03.834 CEST DEBG ][ org.kde.kstars.ekos.scheduler] -   "/home/tallfurryman/Documents/NGC6960/NGC6974-Part13/NGC6974-Part13/Light/H_Alpha/NGC6974-Part13_Light_H_Alpha_480_secs" : 0

(... frames from job 1 to 10 are 4, 11 is 2 ...)

[2019-09-18T21:36:05.737 CEST INFO ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part24' estimated to take 00h 38m 45s to complete."
[2019-09-18T21:36:05.812 CEST INFO ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part25' estimated to take 00h 38m 45s to complete."
[2019-09-18T21:36:05.831 CEST INFO ][ org.kde.kstars.ekos.scheduler] - Option to sort jobs based on priority and altitude is false
[2019-09-18T21:36:05.870 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Schedule attempt #1 for 2325-second job 'NGC6974-Part1' on row #1 starting at 18/09/19 21:36, completing at 18/09/19 22:14."
[2019-09-18T21:36:05.871 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part1' dark sky score is +0 at 18/09/19 21:36"
[2019-09-18T21:36:05.872 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part1' altitude score is +73 at 18/09/19 21:36"
[2019-09-18T21:36:05.873 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part1' Moon separation score is +20 at 18/09/19 21:36"
[2019-09-18T21:36:05.874 CEST INFO ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part1' has a total score of +93 at 18/09/19 21:36."
[2019-09-18T21:36:05.912 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part1' on row #1 passed all checks after 1 attempts, will proceed at 18/09/19 21:36 for approximately 2325 seconds, marking scheduled"
[2019-09-18T21:36:05.932 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Schedule attempt #1 for 2325-second job 'NGC6974-Part2' on row #2 starting at 18/09/19 21:36, completing at 18/09/19 22:14."
[2019-09-18T21:36:05.951 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC6974-Part2' is scheduled to start at 18/09/19 22:16, 120 seconds after 18/09/19 22:14, in compliance with previous job completion requirement."
[2019-09-18T21:36:05.952 CEST DEBG ][ org.kde.kstars.ekos.scheduler] - "Schedule attempt #2 for 2325-second job 'NGC6974-Part2' on row #2 starting at 18/09/19 22:16, completing at 18/09/19 22:55."

(... goes on reevaluating all jobs to proceed instead of keeping completed jobs "completed" ...)

I'm not sure where the regression is actually. Could be another diff in the end?

Hm, quite complicated. Has this ever worked? Nevertheless, I think when running multi-day-schedules, I would use "Remember job progress". In that case I think this situation would not happen.

What is quite good visible from your example, that setting jobs to completed when they hit a constraint is not a good idea. Complete means that the defined amount of frames have been taken - nothing else.

Therefore I would like to proceed in that direction.

Meanwhile, it would be great if we find a way to simulate the regression above in a reasonable timeframe. I tried it but did not succeed. It is not sufficient sending the scheduler asleep. Maybe the scheduler needs to pass a time where no job is executable due to constraints and constraints become valid again.

Hm, quite complicated. Has this ever worked?

That specific issue, that is, completed jobs resetting to scheduled, yes it has. That was part of my tests in 2017.

Nevertheless, I think when running multi-day-schedules, I would use "Remember job progress". In that case I think this situation would not happen.

Agreed, and I confirm it worked yesterday. We have other problems with that option, but in that "simple" case of a mosaic, it is a good workaround.

What is quite good visible from your example, that setting jobs to completed when they hit a constraint is not a good idea. Complete means that the defined amount of frames have been taken - nothing else.

Therefore I would like to proceed in that direction.

I agree. It is a regression from this diff, isn't it?

Meanwhile, it would be great if we find a way to simulate the regression above in a reasonable timeframe.

I very quickly checked the code against my test case, and I'm still puzzled why the situation is different when scheduler is running and looks for a new job (before sleeping in my log) and when scheduler wakes up and looks for a new job. I'll have to check again, slowly this time. I think the observatory shutdown has a role in there.

I agree. It is a regression from this diff, isn't it?

Well, some kind of. Before my diff, these where set to "aborted", which was also wrong from my perspective.

So we try to set jobs to COMPLETED based on frame count, we make sure we do not touch the frame count except when receiving a frame or recounting storage frames, and we do that state change when starting evaluation only? In other words, we never complete jobs during execution? The problem with states is that we must not jitter between values needlessly. Do you have a state diagram in head?

So we try to set jobs to COMPLETED based on frame count, we make sure we do not touch the frame count except when receiving a frame or recounting storage frames, and we do that state change when starting evaluation only? In other words, we never complete jobs during execution? The problem with states is that we must not jitter between values needlessly. Do you have a state diagram in head?

Let me give a try defining COMPLETED for a sequence job seq_j:

completed(seq_j) <--> if repeats exist -> (forall filters in seq_j: sum(relevant_frames(filter)) >= repeats * (sum frames for filter in seq_j)) 
                                      else if repeat until date valid --> current time >= repeat until date

relevant_frames(filter) = if remember job progress -> all matching frames in the frames directory 
                                          else all matching frames from the current scheduler run

I know, there is (at least) one deviation from the current implementation: if a sequence contains more than one entry for a filter, they are counted separately:
3xL + R + G + B + 2xL is not equivalent to 5xL + R + G + B

Additional issue found by @jpaana. I'll check how to fix this one, it's a bit more tricky:

[2019-09-20T04:53:39.695 EEST INFO ][   org.kde.kstars.ekos.capture] - "Capturing 300,000-second H_Alpha image..."
[2019-09-20T04:53:39.697 EEST DEBG ][ org.kde.kstars.ekos.scheduler] - Capture State "Capturing"
[2019-09-20T04:53:39.734 EEST INFO ][           org.kde.kstars.indi] - Atik 383L :  "[INFO] Taking a 300 seconds frame... "
[2019-09-20T04:54:00.721 EEST INFO ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC1491' is now approaching astronomical twilight rise limit at pe syysk. 20 04:54:00 2019 (0 minutes safety margin), marking aborted."
[2019-09-20T04:54:00.723 EEST DEBG ][ org.kde.kstars.ekos.scheduler] - Job ' "NGC1491" ' is stopping current action... 13
[2019-09-20T04:54:00.727 EEST DEBG ][ org.kde.kstars.ekos.scheduler] - Find next job...
[2019-09-20T04:54:00.727 EEST INFO ][ org.kde.kstars.ekos.scheduler] - Executing Job  "NGC1491"
[2019-09-20T04:54:00.732 EEST INFO ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC1491' capture is in progress (batch #11)..."
[2019-09-20T04:54:00.733 EEST INFO ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC1491' is repeating, looping indefinitely."
[2019-09-20T04:54:00.735 EEST INFO ][   org.kde.kstars.ekos.capture] - "Warning: option \"Always Reset Sequence When Starting\" is enabled and resets the sequence counts."
[2019-09-20T04:54:00.738 EEST INFO ][   org.kde.kstars.ekos.capture] - "Job requires 300,000-second H_Alpha images, has 0/5 frames captured and will be processed."
[2019-09-20T04:54:00.741 EEST INFO ][   org.kde.kstars.ekos.capture] - "Capturing 300,000-second H_Alpha image..."
[2019-09-20T04:54:00.756 EEST DEBG ][ org.kde.kstars.ekos.scheduler] - Capture State "In Progress"
[2019-09-20T04:54:00.757 EEST DEBG ][ org.kde.kstars.ekos.scheduler] - Capture State "Capturing"
[2019-09-20T04:54:12.002 EEST INFO ][           org.kde.kstars.indi] - Atik 383L :  "[INFO] Taking a 300 seconds frame... "

Job has infinite repeats, passes the twilight restriction, aborts, then restarts immediately.

[2019-09-20T04:54:00.721 EEST INFO ][ org.kde.kstars.ekos.scheduler] - "Job 'NGC1491' is now approaching astronomical twilight rise limit at pe syysk. 20 04:54:00 2019 (0 minutes safety margin), marking aborted."

Looks like at least the log warning is wrong. Assuming that @jpaana has the latest version, that's strange.

These constraints are ugly to test...

I didn't note the log was strange, I'll double check. It may be related to the change I made on dawn and dusk offsets, which was merged this week.
About restriction, I remember I used a hack to use the simulation time instead of the system time back in 2017. That might be an interesting feature to put back in.

About restriction, I remember I used a hack to use the simulation time instead of the system time back in 2017. That might be an interesting feature to put back in.

No hack indeed, that's already in: just change the simulated time in KStars. The only problem is that sleeping does not take simulation speed into account, this is only one small fix ahead (hence my todo list item to wake up more often before the actual milestone).

D24151 should help us debugging those situations: it allows to change the simulation step in KStars and have Scheduler update the delay it waits sleeping when that happens.
Changing the simulation scale also helps bringing restriction events in faster.

OK, I found the problem with hitting a constraint. In this case the Scheduler calls stop() which sets the job to ABORTED. That's explains the behaviour.

As a workaround, error handling should be disabled for such cases.

Please check D24232, this should hopefully fix the regressions we discussed here.

TallFurryMan mentioned this in D24232: Hitting a constraint sets a job to IDLE instead of COMPLETE so that it might be restarted later.Sep 29 2019, 7:46 PM

Revision Contents
Changeset List

			Path	Packages
M			kstars/ekos/capture/capture.cpp (9 lines)
M			kstars/ekos/scheduler/scheduler.h (18 lines)
M			kstars/ekos/scheduler/scheduler.cpp (220 lines)
M			kstars/ekos/scheduler/scheduler.ui (196 lines)
M			kstars/kstars.kcfg (12 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	61717	7e3757a		Jul 13 2019, 7:17 PM	★	★
Diff 2	61753	03317cc	Serializing error strategy corrected, log messages for aborted jobs unified	Jul 14 2019, 8:02 PM	★	★
Diff 3	65037	cc05215	R321:a8dd7687ca160bb923f814e0c0a034b8ed8c8933	Aug 31 2019, 8:23 AM	★	★

Commit	Tree	Parents	Author	Summary	Date
bf7f2b8e3d23	3a7c86b84c26	5deeb9fa92b8	Wolfgang Reissenberger	Serializing error strategy corrected, log messages for aborted jobs unified	Jul 14 2019, 7:58 PM
5deeb9fa92b8	82eb5bb92a94	89aada1ce4e0	Wolfgang Reissenberger	Scheduler layout improved	Jul 7 2019, 2:23 PM
89aada1ce4e0	8c58cbe2c69e	78b197ef5bc5	Wolfgang Reissenberger	Option for re-scheduling of errors added	Jul 7 2019, 1:43 PM
78b197ef5bc5	e246a4f2e6ef	cc57a94df408	Wolfgang Reissenberger	Option to re-schedule after errors added (not implemented yet)	Jul 4 2019, 5:03 AM
cc57a94df408	99c617f0597f	7044afda4d67	Wolfgang Reissenberger	job state set to complete when stopping due to constraints or completion… (Show More…)	Jun 25 2019, 10:05 AM
7044afda4d67	d8371af92533	fff8ce0af8af	Wolfgang Reissenberger	Option to not restart aborted jobs added	Jun 21 2019, 9:07 AM
fff8ce0af8af	8c2ae236ffdd	aae332ddc634	Wolfgang Reissenberger	Re-scheduling with delay when all jobs are finished or aborted	Jun 21 2019, 8:39 AM
aae332ddc634	a72cde89bc4c	b3520a71c0f5	Wolfgang Reissenberger	Signal an abort while suspended to keep Capture and Scheduler module states in… (Show More…)	Jun 21 2019, 7:38 AM
b3520a71c0f5	6e24d3dd8859	884a47da8793	Wolfgang Reissenberger	Code cleanup, disabled order changes removed	Jun 20 2019, 7:16 PM
884a47da8793	43bc8682b46c	88c2b981b259	Wolfgang Reissenberger	Putting completed jobs and jobs with errors at the end disabled, if sorting is… (Show More…)	Jun 18 2019, 9:35 AM
88c2b981b259	c8ec379e1e0c	0c5440c7ab5a	Wolfgang Reissenberger	Restarting restricted to aborted jobs	Jun 18 2019, 6:28 AM
0c5440c7ab5a	2c52f06e5bc0	2e3bac7dffb2	Wolfgang Reissenberger	Guiding and focusing problems result in aborted state instead of error state	Jun 17 2019, 11:15 PM
2e3bac7dffb2	4173e50fd6b7	f9f7691b48c0	Wolfgang Reissenberger	Error handling description corrected	Jun 17 2019, 10:12 PM
f9f7691b48c0	59958bf4633e	8a8d9848ac14	Wolfgang Reissenberger	Restarting aborted jobs and jobs with errors after a delay	Jun 17 2019, 5:12 PM
8a8d9848ac14	6bc6ad8d6cc4	03317cc17249	Wolfgang Reissenberger	Error handling invisibly added to UI, no functionality yet	Jun 16 2019, 5:43 PM