Update job completion counts, and fix regression on job rescheduling.
ClosedPublic
Actions

Authored by TallFurryMan on Aug 20 2018, 7:18 AM.

Details

Reviewers

mutlaqja
wreissenberger

Commits

R321:7e1f63acfda2: Update job completion counts, and fix regression on job rescheduling.

Summary

This change connects the Scheduler to the new-image signal sent by the Capture module.
If remembering job progress, it triggers a refresh of the capture storage, and updates the time estimation of all jobs.
It also makes sure a completed job is not re-evaluated to schedule in the basic "run-to-finish" and "repeat-N-times" cases.
It also adjust the count of captures in the scheduler table to avoid displaying required counts on jobs that are looping.

When remembering job progress, the new-image event can't be a simple +1 in the count of the running job because of duplicates.
Note that ANY new-image coming from the Capture module will trigger a refresh, even if the Scheduler is not running.
This is not expected to be an issue as currently all jobs, whatever their state (even being edited), can be re-evaluated without issue.

Test Plan

Create a scheduler job list, with or without duplicates, using sequence jobs with or without duplicates.
Test vector "repeated_jobs_no_twilight.esl" is a simple version of this test setup.
You may need to disable option "Sort jobs by Altitude and Priority" to get an ordered schedule.

Run the Scheduler, check capture counts are properly updated (beware, that can be quite difficult to follow in complex cases).
Without the change, no capture count is updating, and the running job is rescheduled incorrectly.
With the change, relevant capture counts update, as well as imaging durations, resulting in proper evaluation of imaging time left (yay!).
When the job finishes capturing, it is considered complete.

The execution path depends on the value of option "Remember Job Progress".
When that option is set, the capture storage will be refreshed on each new capture received to update the completed count.
When that option is not set, the running job will get its completed count increased by one, independently of other jobs.

Also observe as, with the change, the looping last job has only a count of completed captures, instead of a completed/required count.

Notes:
There's an issue that raises from this test, but not exactly a real issue, more a consequence of the implementation.

Because the Capture module has options to add a Filter and a Duration suffix (and that those options are effectively optional),
and because the Scheduler is recognizing capture files from the beginning of the file *name*, captures without suffix *and* captures with
suffix will be counted for a scheduler job using a sequence *without* suffix. On the opposite, a scheduler job using a sequence *with*
suffix will not count capture files *without* suffix.

If we really want to fix that issue all the way through, we need to look into the FITS header to match the capture file with the sequence
properties.

Another important point is the fact that once jobs get their completed count increased, the only way to reset that value is to delete and
re-create the scheduler job, or to load a new scheduler file, which essentially does the same.

Diff Detail

Repository

R321 KStars

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

TallFurryMan created this revision.Aug 20 2018, 7:18 AM

Restricted Application added a project: KDE Edu. · View Herald TranscriptAug 20 2018, 7:18 AM

Restricted Application added a subscriber: kde-edu. · View Herald Transcript

TallFurryMan requested review of this revision.Aug 20 2018, 7:18 AM

Harbormaster completed remote builds in B2011: Diff 40029.Aug 20 2018, 7:18 AM

mutlaqja added inline comments.Aug 20 2018, 8:15 AM

kstars/ekos/scheduler/scheduler.h
212 ↗	(On Diff #40029)	should be setCaptureState. we use newXXX() or XXXupdated() for signals.

Makes sense! Sorry for not noticing.

Changed setCaptureState for updateCaptureState to homogeneize naming.

Harbormaster completed remote builds in B2034: Diff 40078.Aug 20 2018, 7:32 PM

TallFurryMan added inline comments.Aug 20 2018, 7:34 PM

kstars/ekos/ekosmanager.cpp
1805 ↗	(On Diff #40078)	I selected the same naming as the lines above the change. Tell me if those were not a proper example.

I think scheduler should only communicate using DBus. So we need a DBus signal. Part of the plan is to rework the dbus interface of Ekos modules to Properties/Methods/Signals. Currently, it's just methods. This requires some major refactoring, but it's on my TODO list for 3.0.0 and I should start on it next Sunday (after my vacation).

Incidentally I started to work on that differential with a dbus signal. But then I stopped and asked myself, "Isn't that stupid? Why would I want to do that inside a single app?" :)
I can understand the need to notify scripts through dbus, but to me the app signal should be relayed by a separate module that converts the QT signal into a dbus signal.
Else that will result in a dbus API that is scattered throughout all modules, inputs and outputs, with a lot of duplicated control code... It's already difficult to picture the state machines of each module, so the worst is to come if we bloat it with exception management...
Plus you probably can't afford to touch the dbus API if you want to fix a UI issue, which is mostly what QT signals were designed for.

Yes it doesn't make sense but the scheduler is designed in a way that it could be its own separate app later on so this is why we went to DBus. Everything else in Ekos uses signal/slots. I will let you know once I made progress in this in a few days.

After checking how this needs to change, I suggest we merge this diff to clarify the baseline.

I did some tests, looks good. There is one test that already failed in D14684 and is still present. It might be exotic, but it leads to continuous looping.

The setup is the following: create a schedule with two jobs for the same target with the same sequence job (say 1xLum for example). The first job starts and captures the first frame.

And here is where the problem occurs: now calling Scheduler::updateCompletedJobsCount() detects the one frame, but when evaluating the second job (which did not run yet), the value in the captured frames map is overwritten. This leads to an endless loop of the first job.

Here is a test case that demonstrates the problem:

testcases.tar.gz968 BDownload

Copy the files into /tmp. The frames are stored in /var/tmp/kstars_tests

This revision now requires changes to proceed.Aug 29 2018, 9:15 AM

You reported that issue earlier, and I was not able to reproduce it. I will double-check that in the context of the regression I spotted at https://bugs.kde.org/show_bug.cgi?id=397970. Besides, because of the flexibility of the Capture module, I think the frame enumeration needs to parse the FITS header to properly triage between files... That will be fun.

However, because you reported it earlier, I don't think this issue should block this differential?

Agreed, no, this issue should not block. If it's OK for you, I can try to fix it. I know the place where it goes wrong...

This revision is now accepted and ready to land.Aug 29 2018, 10:49 AM

Another small thing: I observed that the counting update happens only after a single sequence is finished and not after each capture. I guess that's by intention...

No that's not the intention! The signal is emitted at each new... Wait wait wait did I made a mistake here? Let me check.

Give this test case a try:

testcases2.tar.gz943 BDownload

At the end, it counts correctly, but it jumps from 0 to 6.

In D14942#317116, @wreissenberger wrote:

Another small thing: I observed that the counting update happens only after a single sequence is finished and not after each capture. I guess that's by intention...

I confirm the intent is to update the frame count each time an image is received by Capture (CAPTURE_IMAGE_RECEIVED).
If the frame count doesn't change, the reason is elsewhere. I'll check that in the context of https://bugs.kde.org/show_bug.cgi?id=397970 too.

In D14942#317072, @wreissenberger wrote:

I did some tests, looks good. There is one test that already failed in D14684 and is still present. It might be exotic, but it leads to continuous looping.

The setup is the following: create a schedule with two jobs for the same target with the same sequence job (say 1xLum for example). The first job starts and captures the first frame.

And here is where the problem occurs: now calling Scheduler::updateCompletedJobsCount() detects the one frame, but when evaluating the second job (which did not run yet), the value in the captured frames map is overwritten. This leads to an endless loop of the first job.

The fix is quite simple. Here is the patch:

looping.patch869 BDownload

In D14942#317187, @wreissenberger wrote:

The fix is quite simple. Here is the patch:
looping.patch869 BDownload

When I look at your patch, I don't readily understand how, at the point you target, newFramesCount could end up already assigned with a value that is different that what we are going to assign.
So let's analyze.

Map newFramesCount is empty at the beginning of the function.
That map receives a value keyed by a signature when...

...the map cache capturedFramesCount does not contain that signature as a key,
...or for JOB_IDLE and JOB_EVALUATION states, for which we don't trust capturedFramesCount and want to recount.

In those two situations, the algorithm calls getCompletedFiles to obtain the storage contents.

Your patch essentially means it is possible for capturedFramesCount to have two different values for the same signature during a single run.
That I agree on, but in order to fix the real issue, the question must be "How is it possible for capturedFramesCount to end up with two different values during the run?".

The reason is, actually, that this function is relying on a cache: capturedFramesCount does not represent the current state of the storage, but is used to avoid the relatively high cost of getCompletedFiles.
When jobs are evaluated, and one job triggers a recount because it is complete, capturedFramesCount will be refreshed with getCompletedFiles, and that will introduce a discrepancy for other jobs that were evaluated before.
Depending on the evaluation order, sometimes we might see that discrepancy, sometimes not.

I believe this is the root cause to fix, and I think your patch does not address that. What do you think?

Hm, I think the reason is the strategy how the cache is updated and not a pure invalid cache problem.

Lets consider Job A and Job B having both the identical signatures with one Luminance frame, Job A starts before Job B, both run in two cycles. As a result of Job A finishing (if Job A does not get into trouble), we would expect Lum-001.fitsand Lum-002.fits.

Now let's see what happens if the first of the two cycles of A is finished. Job A has produced Lum-001.fits and returns to evaluateJobs() which calls updateCompletedJobsCount(). The latter one cycles over all sequence jobs. When it passes Job A (remember, it's the first one in the list), which is in the state JOB_EVALUATION, it calculates

newFramesCount["...Lum"] = getCompletedFiles("...Lum", ...) = 1

After this, Job B is evaluated for the same signature "...Lum", but it uses another branch in the loop and takes the cached value

QMap<..> sigCount = capturedFramesCount.find("...Lum")
newFramesCount["...Lum"] = sigCount.value() = 0

Since capturedFramesCount gets updated only after all sequence jobs have been evaluated, it takes the old value from the run before.

So we need a solution how the cache is updated accordingly. I think my solution is not bad, since it takes the maximum of the (old) cached value and a fresh calculated value. The alternative is avoiding cache access during the cache update completely.

I agree your patch will fix the use case you are describing. What I mean is that, because of what the root cause is, I would for instance prefer a patch that checks jobs, and if one of those has a state that will make the cache incoherent, drop the cache before looping through jobs.

Extending on this, I think I'd rather drop the whole concept of cache and rewrite those functions using parallel asynchronous jobs. We had a lot of regressions due to that frame map, probably because it was an early optimization.

OK, got the point, you are right. But parallel asynchronous jobs - hm. That also adds complexity.

Another idea: do we really need to iterate over all jobs? After a job finishes, we know, for which signatures the values might have changed: exactly those of the job just finishing. Why not just updating those signatures of the job that just has been executed? And simply straight forward by updating the cached value with updateCompletedJobsCount().

Because that requires checking all jobs for sequences that include the signature matching the file name under which the capture was stored. That is indeed adding complexity, and duplicates what the code we are talking about is doing. And it requires a different signal than what we have today.

Besides, capture storage is entirely outside of our control. We cannot be sure that when an image is signaled, the capture file is effectively stored because that's some other module that did it. We also have no control over what the end-user is doing in parallel, perhaps that person is checking and dropping frames (although it is debatable whether Scheduler should be affected by this). In the future, that capture storage may be a a remote database with which we interact through a REST API, who knows? (astrobin I'm looking at you)

Do we really to consider what happens on the storage? I would simply count a frame if it has been stored successfully. If it disappears at a later stage, why should we care?

I personally tend to move captured images during a the night from my raspberry pi to my NAS storage while the camera is shooting. I would prefer if the scheduler simply counts captured frames, no matter whether they still sit on the place where they have been stored originally.

Maybe it is the easiest way to keep all job counts in the memory as long as EKOS is running. After a reset, all counts should be set to zero.

Too radical?

This is not too radical: this is what disabling "Remember Job Progress" results in :) See how unclear the feature name is?

You may for instance consider that option "Remember Job Progress" is there to control the imaging duration. Imagine you want 8 hours of usable frames on a particular object, you setup Scheduler to take say 32 15-minutes frames on that target. Then whatever you do to these frames in-between, the job of Scheduler is to provide you with those 32 frames over a few nights.

Well, indeed, the feature name is not so self explanatory. But currently, the flag is not used inside of updateCompletedJobsCount(), right? As far as I understand, it is only used to estimate the duration. Looks like both methods could share a lot of code... OK, it looks like as there is room for improvement Capture.

As a conclusion: I would recommend that we implement the following cache update algorithm:

determine updates for a for JOB_IDLEand JOB_EVALUATION and collect these updates in newFramesCount
merge these map into the old cache map in order to avoid overwriting with old cache values (not simply replacing the old cache)

mutlaqja added inline comments.Aug 31 2018, 10:00 AM

kstars/ekos/ekosmanager.cpp
1805 ↗	(On Diff #40078)	I think this needs to wait until the new DBus interface is complete. It's in progress in dbus_work branch.

In D14942#317851, @wreissenberger wrote:

Well, indeed, the feature name is not so self explanatory. But currently, the flag is not used inside of updateCompletedJobsCount(), right? As far as I understand, it is only used to estimate the duration. Looks like both methods could share a lot of code... OK, it looks like as there is room for improvement Capture.

As a conclusion: I would recommend that we implement the following cache update algorithm:

determine updates for a for JOB_IDLEand JOB_EVALUATION and collect these updates in newFramesCount

merge these map into the old cache map in order to avoid overwriting with old cache values (not simply replacing the old cache)

As a transitory and not satisfactory measure, I proposed D15230. Let's continue discussing there.

OK - already done :-)

Waiting for branch dbus_work to be merged to rework this differential.

Makes sense, DBUS should have priority.

You could use the new newSequenceImage DBus signal from Capture module for this purpose.

TallFurryMan retitled this revision from Connect new-image event to Scheduler imaging time estimator. to Update job completion counts, and fix regression on job rescheduling..Oct 2 2018, 12:14 AM

TallFurryMan edited the summary of this revision. (Show Details)

TallFurryMan edited the test plan for this revision. (Show Details)

Reworked differential to fix schedule of jobs.

This revision is now accepted and ready to land.Oct 2 2018, 12:15 AM

Harbormaster completed remote builds in B3411: Diff 42699.Oct 2 2018, 12:15 AM

It seems I accepted my own change, but I have no idea how I did that. It was a bit late, sorry.
This morning I finished verifying all test vectors, so this differential is ready from my point of view and I will base further changes on it.
This differential might not be a bit rough in two cases: approaching twilight, and approaching altitude limit. Those issues are next on my list.

Hi Eric,
I get an error message from arc patch

arc patch D14942
Created and checked out branch arcpatch-D14942.

    This diff is against commit 018f6986f3887648a6f31f0597de063a53a92675, but
    the commit is nowhere in the working copy. Try to apply it against the
    current working copy state? (40d75559189dabe735a9f9e55c0641ba86a58ef7)

Should I continue or otherwise where could I get the missing commit? It does not seem to exist in the GIT master

Great, I'll test it now.

mutlaqja accepted this revision.Oct 3 2018, 12:40 PM

Closed by commit R321:7e1f63acfda2: Update job completion counts, and fix regression on job rescheduling. (authored by TallFurryMan, committed by mutlaqja). · Explain WhyOct 3 2018, 12:41 PM

This revision was automatically updated to reflect the committed changes.

In D14942#335844, @wreissenberger wrote:
Hi Eric,
I get an error message from arc patch
arc patch D14942
Created and checked out branch arcpatch-D14942.

    This diff is against commit 018f6986f3887648a6f31f0597de063a53a92675, but
    the commit is nowhere in the working copy. Try to apply it against the
    current working copy state? (40d75559189dabe735a9f9e55c0641ba86a58ef7)
Should I continue or otherwise where could I get the missing commit? It does not seem to exist in the GIT master

I'm sorry I have no idea... Is it that you should fetch the remote server before patching?

In D14942#335937, @TallFurryMan wrote:
In D14942#335844, @wreissenberger wrote:
Hi Eric,
I get an error message from arc patch
arc patch D14942
Created and checked out branch arcpatch-D14942.

    This diff is against commit 018f6986f3887648a6f31f0597de063a53a92675, but
    the commit is nowhere in the working copy. Try to apply it against the
    current working copy state? (40d75559189dabe735a9f9e55c0641ba86a58ef7)
Should I continue or otherwise where could I get the missing commit? It does not seem to exist in the GIT master
I'm sorry I have no idea... Is it that you should fetch the remote server before patching?

Aaah, found the solution. I need to add your repository as remote, then everything works fine.

Ah ! Interesting ! Thanks!

Looks good, my tests succeeded. Excellent, this feature is very helpful!!

Revision Contents
Changeset List

		Path
A	M	Tests/scheduler/repeated_jobs_no_twilight.esl (115 lines)
M		kstars/ekos/scheduler/scheduler.cpp (70 lines)
M		kstars/ekos/scheduler/scheduler.ui (3 lines)
M		kstars/ekos/scheduler/schedulerjob.cpp (57 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	40029	279e64a		Aug 20 2018, 7:18 AM	★	★
Diff 2	40078	279e64a	Changed setCaptureState for updateCaptureState to homogeneize naming.	Aug 20 2018, 7:32 PM	★	★
Diff 3	42699	018f698	Reworked differential to fix schedule of jobs.	Oct 2 2018, 12:15 AM	★	★
Diff 4	42788	40d7555	R321:7e1f63acfda20f81dec6d82eb60b99bad859668f	Oct 3 2018, 12:40 PM	★	★

Commit	Tree	Parents	Author	Summary	Date
4f2ddd82abc6	8dcc98dbb183	4a9ae2db4518	Eric Dejouhanet	Update job completion counts, and fix regression on job rescheduling (Show More…)	Sep 28 2018, 6:19 AM
4a9ae2db4518	1a24262caf0b	018f6986f388	Eric Dejouhanet	Fix completion count when rememberJobProgress is not enabled.	Oct 1 2018, 6:23 AM