Capture counting corrected
ClosedPublic

Authored by wreissenberger on Mar 24 2019, 9:59 PM.

Details

Summary

In some situations the scheduler does not count correctly:

  1. Remember Job Progress not set
    • Create a schedule with at least two jobs and let them run. After the last one terminates, a null pointer exception occurs.
    • Create a schedule job with a fixed amount of repeatings where the imaging job has more than one sequence job (i.e. LRGB). After the first run of the sequence, the amount of captured frames is a multiple of the frames captured.
  1. Remember Job Progress is set
    • Similar setup as 1. Let the job run until completion. New delete one single frames and restart. The log shows that "Job ... is proceeding directly to capture stage because only calibration frames are pending", the scope does not slew but capturing is started immediately.
Test Plan

Run the two situations with and without this correction.

Diff Detail

Repository
R321 KStars
Branch
capture_counting
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 10210
Build 10228: arc lint + arc unit
wreissenberger created this revision.Mar 24 2019, 9:59 PM
Restricted Application added a project: KDE Edu. · View Herald TranscriptMar 24 2019, 9:59 PM
Restricted Application added a subscriber: kde-edu. · View Herald Transcript
wreissenberger requested review of this revision.Mar 24 2019, 9:59 PM
wreissenberger edited the summary of this revision. (Show Details)Mar 24 2019, 10:03 PM
TallFurryMan requested changes to this revision.Mar 25 2019, 7:06 AM
TallFurryMan added inline comments.
kstars/ekos/scheduler/scheduler.cpp
1455 ↗(On Diff #54733)

A null pointer exception here? I see no direct reason for it (in setCurrentJob), nor for the removal of the current job if no job is left to run, could you elaborate on the consequences?

5061 ↗(On Diff #54733)

This fixes the count correctly, OK.

Side note: I think SchedulerJob cannot just be a model (with the latest changes it's less and less a model actually). We need to move the repeat count management to that class at some point. That's another story I'd like to have a look at.

5212 ↗(On Diff #54733)

I don't understand why you removed that line. Variable "captures_completed" can't always equal zero when not remembering progress: runtime events may abort the job, but still number of captures must be retained. This would cause areJobCapturesComplete to flip to true immediately...

5268 ↗(On Diff #54733)

If the algorithm is OK, I would expect totalCompletedCount == captures_completed if rememberJobProgress is false.
I see the "optimization" andI agree on the principle, but I'd prefer a debug assertion here in a else. This change is introducing a dependency between the end and beginning of the algorithm, while we strive to clarify it.

This revision now requires changes to proceed.Mar 25 2019, 7:06 AM
wreissenberger requested review of this revision.Mar 25 2019, 3:38 PM

Please check my comments, from my point of view everything should be fine.

kstars/ekos/scheduler/scheduler.cpp
1455 ↗(On Diff #54733)

The problem occurs in line 4675 (sic!):
if (currentJob->getRepeatsRemaining() == 0)
when trying to find the next job. evaluateJobs() is called before and sets currentJob to nullptr if no jobs are left over to be executed.

5212 ↗(On Diff #54733)

This is irrelevant, since the number of completed frames is only changed in the case that we remember the job progress (see lines 5266ff). In theory we could leave it, but it is not used in the case that we do not remember the job progress.

This together with lines 5266ff resolve the second bullet point of 1. In the old implementation, the number of completed frames has been added for each of the sequence job entries and counted the same value multiply.

5268 ↗(On Diff #54733)

An assertion does not make sense for the else branch, since in this case the number of completed frames is already set in the SequenceJob.

By the way: I found out that the null pointer exception occurs also when there is only one scheduler entry. So I think it's quite urgent that we at least fix this issue...

TallFurryMan added inline comments.Mar 26 2019, 8:52 PM
kstars/ekos/scheduler/scheduler.cpp
1455 ↗(On Diff #54733)

Then a fix is needed inside findNextJob, or if findNextJob assumes currentjob cannot be nullptr, there should be a debug assertion, and the call of findNextJob should be restricted.
HOWEVER, I agree evaluateJobs must not change the current job, but that should not be part of this differential.

1952

The whole block after this should not be part of the evaluation I think as it does more than that. It should be part on the execution of the schedule. But let's keep this for another time.

2003

There is another one here that could cause your dereference. Still findNextJob should get a fix, eventually use the whole enclosing block in the future.

5212 ↗(On Diff #54733)

I still don't understand : capture_completed is used in areJobCapturesComplete to determine whether captures are needed, plus it may happen that some frames but not all are already captured when executing that block, even when not remembering progress. Removing this considers everything is still needed, which isn't true.

Unless what you say is that we are counting the SCHEDULER job frame count as many times as there are SEQUENCE jobs, and that I agree is completely wrong! Good spot!

So it happens that we cannot estimate the duration of a Scheduler job when not remembering progress? Couldn't we use the frame map?

5268 ↗(On Diff #54733)

OK to not change the job count if not remembering progress.

By the way: I found out that the null pointer exception occurs also when there is only one scheduler entry. So I think it's quite urgent that we at least fix this issue...

Agreed! I've been hit by exactly this yesterday evening, using a build that was one month old.
My dumb brain needs to agree with you on capture_completed, and we're good to go.
Unless maybe you split your diff into two to manage findNextJob in one of them and the duration estimation in the other?
There's a need on splitting evaluateJobs into a real job evaluation and the runtime selection that should be in findNextJob, if you see what I mean.

Your comment about capture_completed is right, I'm trying to fix it. The problem is, that captures_completed = schedJob->getCompletedCount() is called for each sequence job entry and added to totalCompletedCount, i.e. the total amount of the entire capture run is added multiply.

So take a simple LLRGB capture sequence where the scheduler defines 3 runs. After one run of the capture sequence, totalCompletedCount = 5*3 = 15, i.e. the scheduler terminates after the first iteration (which is wrong).

On the other side, captures_required needs to be calculated inside of the capture sequence loop to calculate (2+1+1+1)*3=15 captures required. If we had a method where the capture job could answer the 2+1+1+1 question directly, we would't need the loop either.

Situation is clear, thanks, and sorry for missing the point at first.

Block 5154-5157 retrieves the frame count from the map.
The map is only valid when checking the storage. During that check we list the output folder of the sequence job. When "Remember Job Progress" isn't set, we don't fill the map <-- probably we should!
I suggest we update the capture map around line 6970: we either just update if not remembering, or recount from storage if remembering. I was probably too lazy at that time to do that, or maybe my differential was already too big.
Then we remove the condition on rememberJobProgress at line 5124, and we still go into enumerating sequence jobs with the map that was consolidated while captures were received.
We keep your removal at line 5214 and your if at line 5268.

What do you think?

Well, filling the map in any cases will not help, because the intention of not ticking "Remember progress" is exactly, that we start counting from zero - no matter how many are in the file system.

The core problem is, that we do not count captures on single capture sequence level. As a consequence, we are not able to calculate the remaining capture time. It simply is quite tricky... I will dig into it the next days...

Well, filling the map in any cases will not help, because the intention of not ticking "Remember progress" is exactly, that we start counting from zero - no matter how many are in the file system.

No! Not remembering job progress means that we do not update the frame count from storage in the scheduler. It does not mean that we start from zero each time in the scheduler. We do, however, start from zero when executing a sequence job when not remembering progress. We still count frames captured at the scheduler job level.

The core problem is, that we do not count captures on single capture sequence level. As a consequence, we are not able to calculate the remaining capture time. It simply is quite tricky... I will dig into it the next days...

Agreed, that's what I propose to do in my previous post. Counting frames when they are emitted by the capture module.

I think I found a way to correct the counting of frames for the case that remembering is ticked off.

And yes, Eric, I fully agree what you said about counting - that's what I tried to express, but obviously it didn't come around.

Hope we are through now.

wreissenberger marked 4 inline comments as done.Mar 27 2019, 11:21 PM

Can you please check this?

https://indilib.org/forum/general/4908-meridian-flip-issue-with-the-scheduler.html

The user mentioned checking "Remember Job Progress", though not sure how this is related to the issue he's describing. Any thoughts?

Hm, maybe, but I think it's a different issue. But thanks for the hint, I will try to reproduce it.

Update: There was an another bug in calculating whether light frames are required by a schedule job. In the case that "Repeat for x runs" is selected and one sequence has more captures than required and another one has less, the scheduler job assumes than no light frames are required - which is wrong.

Test case example

  • Take a LRGB sequence and set the scheduler to 3 repeats
  • Let it run, so that you obtain 3xL, 3xR, ...
  • Delete the L frames
  • Change the scheduler job to 2 repeats and leave the rest as is
  • Restart the scheduler job

Expected result: the L sequence should run twice while no more RGB frames are created.

Without this fix, the log message appears that only calibration frames are necessary and does not slew to the target but immediately starts to capture frames.

Thanks!!

Eric, were you able to test this?

That's a good idea, but weeell I have two disagreements : first this is integer calculation and you probably need to reorder your operators, and second, if I understand correctly, you are considering the amount of captures to get equiprobable scattering over all sequence jobs.
I'll nonetheless test this asap.

kstars/ekos/scheduler/scheduler.cpp
4601

I believe that if you arrive here with currentJob equal to nullptr, there are precautions to take earlier than line 4676?

That's a good idea, but weeell I have two disagreements : first this is integer calculation and you probably need to reorder your operators, and second, if I understand correctly, you are considering the amount of captures to get equiprobable scattering over all sequence jobs.
I'll nonetheless test this asap.

Could you please be more specific? To be honest, I do not understand what you mean.

kstars/ekos/scheduler/scheduler.cpp
4601

currentJob == nullptr might be result of evaluateJobs() - so I do not see how to handle this before line 4676.

It's unclear, but maybe, yes. Could I offer my own implementation on the two fixes that are in this differential? I'd like to first fix the FindNextJob issue, then in another diff the frame counting via messages from the capture module.

My previous message is about the calculation : the order of operators produces 1 when less than a full sequence is executed. It also considers sequences are distributed equally between jobs, which I disagree with as the code, in remember mode, is trying to gather frames to complete sequences in order, then schedule remaining ones.

It's unclear, but maybe, yes.

Agreed, the issue we are working here is a good hint. I asked for more details on the forum.

Could I offer my own implementation on the two fixes that are in this differential? I'd like to first fix the FindNextJob issue, then in another diff the frame counting via messages from the capture module.

Absolutely fine, I do not have the ambition that I fix it. I simply want it to be fixed asap. :-)

My previous message is about the calculation : the order of operators produces 1 when less than a full sequence is executed. It also considers sequences are distributed equally between jobs, which I disagree with as the code, in remember mode, is trying to gather frames to complete sequences in order, then schedule remaining ones.

You are right, the calculation of captured frames of a certain sequence job is only correct as long as the entire capture job has completed. If it does not run completely, the frames taken in the last cycle are not counted correctly.
Let's take an example with a LLLRGB sequence that completes twice and terminates after two L frames. In the calculation, we have schedJob->getCompletedCount() = 14, capturesPerRepeat=6 and seqJob->getCount() = 3. As a result we get 14/6*3 = 6 - which is wrong, it should be 8.

Nevertheless, it is by far better than the current situation, where the result would be 14.

Personally, I would recommend to apply this diff and fix the counting in a subsequent one. If we want to do it perfectly right, we need to use the capturedFramesMap, so that a captured frame is documented properly. But this requires some effort.

Related? https://indilib.org/forum/ekos/4995-ekos-scheduler-eats-frames-from-the-sequence.html#37550

With the latest information posted - YES. It's exactly the situation with a capture job of 18*L+3*R+3*G+3*B.

Related? https://indilib.org/forum/ekos/4995-ekos-scheduler-eats-frames-from-the-sequence.html#37550

With the latest information posted - YES. It's exactly the situation with a capture job of 18*L+3*R+3*G+3*B.

Turned out meanwhile that it is not related. It is simply missing awareness of the option "Remember job progress". Maybe we should consider moving this option directly to the scheduler tab - or show at least there, that the option is active.

Yeah I agree, let's move it to scheduler page.

Back to the original question, how to proceed. I tried to construct a situation where the calculation of captured frames is call after the partial completion of a scheduler job where exactly that problem occurs we discussed above - I failed, I could not bring kstars into such a state. So maybe we have a theoretical discussion here.

I'm sorry my availability is really low these days. Hopefully "better" in the coming w16 as I'll be in holidays.
When a scheduler job aborts, it does not change the completed frame count. So probably an in-sequence focus failing might do the job?

When a scheduler job aborts, it does not change the completed frame count. So probably an in-sequence focus failing might do the job?

Aborting, right, but restarting resets the counters to zero.

Any consensus reached it? Should this go into 3.2 before I release it officially or delay it to 3.3?

I would opt for 3.2. The behavior is better than without the fix.

mutlaqja accepted this revision.Apr 14 2019, 3:49 PM

Ok I plan to release 3.2 today as there are a lot of pending fixes that needs to go out. So I'll include this fix, and let's figure out if it causes any regressions.

This revision was not accepted when it landed; it landed in state Needs Review.Apr 14 2019, 3:49 PM
Closed by commit R321:18fdbc7f39a5: Capture counting corrected (authored by wreissenberger, committed by mutlaqja). · Explain Why
This revision was automatically updated to reflect the committed changes.