Differential D12485

Scheduler adjustments and fixes
ClosedPublic
Actions

Authored by TallFurryMan on Apr 24 2018, 6:38 AM.

Details

Reviewers

mutlaqja

Commits

R321:ea68c70fced0: Scheduler adjustments and fixes

Summary

Added D-bus error management for several requests. Updated job status when needed in case of error.

Aborted some jobs instead of marking them invalid, in order to give them a chance to reschedule. Other no-go situations mark the jobs invalid directly.

Issued a warning when one of the jobs is using FINISH_LOOP, as this will prevent other jobs from scheduling. Will rework to update priority when there are other uncompleted jobs in the list.

Prevented dawn verification when twilight check is disabled for a job. Besides, this allows daily testing in my basement :) Also fixed twilight enforcement on culmination-based jobs.

Sometimes when pausing/restarting the scheduler, the state machine misses a step regarding mount parking state. Added a safeguard to unpark the mount when trying to slew, instead of waiting indefinitely for the end of an impossible slew.

Updated log after checking altitude calculation to clarify situation.

Stopped guiding when aborting a job and completing a looping job. Aborting a job should reset guiding, as we're not able to continue on the current target. Completing a looping job that is not guiding requires realignment.

Also updated failure count management. Alignment now resets the mount model at the last attempt before failing, and only if the settings option is enabled. When alignment fails, job is aborted, not terminated, it could be passing clouds for instance. Successful alignment properly resets the failure count.

Similarily, guiding now properly resets the failure count if successful.

Similarily, capturing now restarts the capture stage when failing, or guiding stage if it is guiding that failed. When capture fails, job is aborted, not terminated, but it's unknown whether it is possible to recover from that failure. Successful capture properly resets the failure count.

SchedulerJob now manages column resizing. This required the cell to be inserted in the table prior to be configured, but UI is not refreshing yet at this step. Now timestamps should format the table properly every time they are updated.

SchedulerJob refactor, documentation. Moved getters to interface to clarify implementation. Added some cell updates that were missing for some fields.

Add current scheduler job name and stage. This change displays the name and stage of the current job in the scheduler interface, so that the end-user knows what the scheduler is doing.

Rework job states and refactor job reset. Keeping scores now that jobs are scheduled properly. START_ASAP jobs that are scheduled auto-switch to START_AT when given a start time, and reset to the original user setting.

Refactored resetting jobs to original settings inside SchedulerJob, so that all resets behave the same.

When the scheduler is stopped, jobs will abort and not restart. Jobs have to be reset using the evaluate button before they can restart. This is a safeguard to clean up weird states when pausing/resuming the scheduler.

FIXME on leaving the mount to track while the scheduler is sleeping could result in problems if the end-user stops the scheduler or the system crashes. At the very least the mount should not be requested to track.
FIXME on negative scores that are weird for the end-user. Probably as soon as a negative score is obtained, no further score evaluation should be attempted, just in case a subsequent one adds enough to return positive.
FIXME on altitude cutoff that has an issue with rising targets, which will get aborted. Mitigated by a previous change which reschedules aborted jobs, but incorrect nonetheless.
FIXME on manual execution of startup/shutdown scripts which probably requires INDI to be connected to succeed. Warning or notification should ensue properly.

Test Plan

Same as previous differential. There are still some edge cases, plus one issue with the remote solver when Ekos shutdowns INDI while the solver is still working. Apparently the solver is stopped after INDI, leading to a crash.

Diff Detail

Repository

R321 KStars

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

TallFurryMan created this revision.Apr 24 2018, 6:38 AM

Restricted Application added a project: KDE Edu. · View Herald TranscriptApr 24 2018, 6:38 AM

Restricted Application added a subscriber: KDE Edu. · View Herald Transcript

TallFurryMan requested review of this revision.Apr 24 2018, 6:38 AM

with arc patch, that's what I get now:

commit ea0a4b65aef73ef26e3269c19896d19a1a191c65
Author: TallFurryMan <eric.dejouhanet@gmail.com>
Date: Tue Apr 24 10:36:12 2018 +0300

Can you correct this so that it uses your name?

Ah so I didn't understand your original query. What is it that you want me to change? "TallFurryMan", which I set to use the same username as github, to "Eric Dejouhanet"?

I have an additional question that I forgot to ask on the forum.
At scheduler.cpp:985 in this patch, I removed what I previously added to reevaluate aborted jobs. So right now, aborted jobs are not retried, and we're in the same situation as before my changes. Specifically, when stopping the scheduler, you need to click on "evaluate jobs" or double-click on a job to set it for re-evaluation. But I think this would be a great addition to the scheduler that this re-evaluation would be automatic.
I couldn't make up my mind and wanted to ask, but now I think I should go for it and enable it for users to test.
So I'll take that author update as an opportunity to enable that change, for the Greater Good :)

Yes I was wondering about that when you initially added it. Though it looks like your changes might have caused a regression. http://indilib.org/forum/general/3333-repeat-scheduler.html#25445

Can you please verify this?

Well that forum post is not a regression per se, it's just that it works as expected now :) The Scheduler is checking how many captures it has to do, and the Sequence is pointing at where those captures are stored, so it's expected duplicated SchedulerJobs switch to complete at the same time.
I'll reply to that post.

My question actually related to restarting aborted jobs. Currently aborted jobs stay aborted until they are reset. I feel we need to change that, so that the Scheduler will retry those aborted jobs until they do succeed eventually. It's just that those restarts must be done after other jobs were processed. So that, for instance, passing clouds breaking alignment stage may be properly considered as a transitory situation, and the target retried later on.

Ok I see. But we have to be careful here. So you're suggesting that aborted are retried _after_ all other jobs are complete or just after the whatever current job is running now complete? What if the job aborts again, it moves from transitionary to permanent?

I think the most plausible solution is to restart aborted jobs when findNextJob doesn't find a new job to process (all valid jobs complete or aborted), or when the next job is scheduled with a delay that allows the estimated time to fit.
Perhaps increasing the priority of aborting jobs by an arbitrary value would be quite elegant too.

Did reset aborted jobs for reevaluation when starting the scheduler.
Fixed the issue with jobs close from each other not being rescheduled properly because they have no defined start time.
Refactored currentJob assignment so that the current job state on the lower right of the queue table is coherent.
Restored bad score set on jobs scheduled in the future, not sure I really understand the intent.
The job selection algorithm seems to always select the job with the lowest score first, and that feels weird...

Small check added to avoid spamming the user with failed evaluation when the evaluated job already has a negative score.
Probably related to altitude cutoff, the algorithm should eventually immediately recalculate a suitable time.

OK so now one thing that remains is that scheduler.cpp:1243 is resetting the schedule for jobs that were rescheduled because they were too close. The knots are untangling progressively.
Plus, there seems to be a bug in the altitude calculation, which can schedule targets to start very soon with a very low altitude. Altitude cutoff again, probably.

Well, I think I need an advice: scheduled jobs need a start time, but if multiple jobs start at the same time, the algorithm adding an offset to the start time will trigger.
Before D12485, jobs scheduled asap would not get a start time, so the algorithm adding the offset would manipulate invalid dates, and in the end be ineffective.
So incidentally, by scheduling asap jobs to start "now", I properly have the algorithm adding the offset to valid dates, but this causes a regression because this offset now delays the startup of the jobs, which were contiguous before.
However, this situation is probably due to the incorrect estimation of the duration of the job, which is used as offset. Well, actually, that offset is clamped to the lead time anyway, so jobs are separated by 5min minimum when they go through that part.
So all in all, for this specific point, the previous behavior was somehow better than what the D12485 brings, so I'll probably revert if I don't find a more elegant solution, which might be to drop the algorithm adding offsets.

Well, with all these changes I've been thinking, why specify a starting time for other jobs as well? Since now we are always evaluating, we just need to figure out when to start the next job and that's it. After the "next" job is completed/aborted..etc, then we can do re-evaluation again and decide time for next job. So the previous algorithm was based on the fact the evaluations were done once, and therefore all times for all jobs must be known and accounted for in advance, but with the new strategy, this is no longer the case. I'm not sure if there are other uses for startup time that I'm not aware of?

From what I could compile from the original algorithm, the idea was that START_ASAP jobs were either evaluated to START_AT with a schedule time but a bad score, or left to START_ASAP with an invalid schedule time and a good score. Then, START_AT jobs with an original state at START_ASAP would get reset to START_ASAP when evaluating, get an invalid schedule time but retain their score. This led to an incoherence between starting the scheduler as is, vs. evaluating first then starting the scheduler, or vs. double-clicking a job then starting the scheduler.

Now that evaluation is done systematically, the remaining problem is to determine which START_AT job to start first (one which was originally a START_ASAP or one which was originally a START_AT), and whether that START_AT job can be scheduled "asap", in other words if its preliminary schedule information can be discarded without side effect. There's no reason preliminary schedule information should become invalid, except in the case anterior jobs didn't go as planned and were either done with sooner or later than expected. Which indeed is a very probable outcome actually from experience.

For the next commit, I will rework the lead time check to cater for this, and sort jobs per their schedule when selecting the best candidate. Now that I got many things clarified, the commit should be far more concise...

Sorting jobs per score still seems weird to me. Also, I'll provide an .esl on the forum for users to test and feedback.

I checked your proposal and how it could be implemented, but the main problem I see is that the end-user do appreciate to see when a particular job is scheduled in advance.
Which means an evaluation that leads to a preliminary schedule of all jobs is important, as well as subsequent evaluations that result in the most similar schedule possible.

I suppose we can have hard vs. soft startup times? The *hard* ones is when the user explicitly specifies them, while the soft ones are subject to change and are given as mere estimations?

We do have that partially yes: fileStartupCondition vs. startupCondition. The idea would be to compare those to understand whether startupTime is fixed or not, and either start the next job immediately or wait for the explicit time. I'll attempt to do that in the next differential.

This revision was not accepted when it landed; it landed in state Needs Review.Apr 26 2018, 6:51 AM

Closed by commit R321:ea68c70fced0: Scheduler adjustments and fixes (authored by TallFurryMan, committed by mutlaqja). · Explain Why

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

			Path	Packages
M			kstars/ekos/scheduler/scheduler.h (5 lines)
M			kstars/ekos/scheduler/scheduler.cpp (493 lines)
M			kstars/ekos/scheduler/scheduler.ui (10 lines)
M			kstars/ekos/scheduler/schedulerjob.h (142 lines)
M			kstars/ekos/scheduler/schedulerjob.cpp (252 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	32946	0111020		Apr 24 2018, 6:38 AM	★	★
Diff 2	33021	0111020	Did reset aborted jobs for reevaluation when starting the scheduler.	Apr 24 2018, 11:02 PM	★	★
Diff 3	33022	0111020	Small check added to avoid spamming the user with failed evaluation when the…	Apr 24 2018, 11:13 PM	★	★
Diff 4	33123	093e671	R321:ea68c70fced0ef58454dd7fe5cd12fefdfcb473b	Apr 26 2018, 6:50 AM	★	★

Commit	Tree	Parents	Author	Summary	Date
9618f55c333e	9a3912447995	35d132dbe028	Eric Dejouhanet	Scheduler adjustments and fixes (Show More…)	Apr 8 2018, 1:44 PM
35d132dbe028	8d9a2b68e792	dd362c3ab61e	Eric Dejouhanet	Rework job states and refactor job reset (Show More…)	Apr 24 2018, 6:19 AM
dd362c3ab61e	fd311ef495a4	fb3f106ec2d3	Eric Dejouhanet	Add current scheduler job name and stage (Show More…)	Apr 22 2018, 10:18 PM
fb3f106ec2d3	ea855f3e7916	37c346103134	Eric Dejouhanet	SchedulerJob refactor, documentation. (Show More…)	Apr 22 2018, 7:25 PM
37c346103134	6ecf4810a17c	41d8bf06241a	TallFurryMan	Resizing job columns from content (Show More…)	Apr 8 2018, 11:17 AM
41d8bf06241a	9290e54228fe	0111020ce4e6	TallFurryMan	Scheduler adjustments and fixes (Show More…)	Apr 8 2018, 1:22 PM