Fix parking engine, and make observatory startup job-centric
ClosedPublic

Authored by TallFurryMan on Aug 25 2018, 9:42 AM.

Details

Summary
  • Consider a mount that cannot park is always unparked. This fixes the state of the mount as seen by the Scheduler when the mount cannot park.
  • Ensure getting parking state returns a valid state, even if last action was failed. This fixes an issue arising when parking operation is aborted, that caused the Mount interface to always return PARKING_ERROR.
  • Make the Scheduler use isMountParked when starting in case the state machine is not up-to-date. This fixes the issue where the Scheduler is started with a mount parking state that was manipulated from the INDI panel in-between.
  • Fix remaining issues in transitions of the parking engine states. This fixes incorrect transitions in the state machine, unrelated to the actual state they were attempted in.

The next changes prepare the rework of the D-Bus connection of the Scheduler towards an event-driven implementation.
These improve Scheduler robustness by managing device events that are not reflected in the state machine.
They consolidate on the parking state changes, and are pushed now because they fix remaining issues with job startup.

  • Refactor INDI connection status test, to avoid depending on the machine state.
  • Move unpark check down to the slew procedure.
  • Refactor state checks into a call to checkStatus, managing most external events.
  • Move job startup from evaluateJobs to executeJob.

The big change in this differential is that the first job is executed AFTER the observatory starts up.
This makes Ekos take control of the observatory at the moment the Scheduler is started, with device connections and setup unparks.
The advantage of this method is that any device issue will fire when the end-user presses the start button, not a few hours later when the Scheduler wakes up.

Test Plan

As this differental relates to parking states, use a basic scheduler job with only tracking set.
Uncheck "Stop Ekos After Shutdown" in the Ekos options to avoid losing driver states when the job finishes.

  • Start Scheduler, check job/Ekos/INDI are started, executed and terminated.
  • Test with "UnPark Mount" and "Park Mount" checked and unchecked.
  • Test with a Scheduler stop in the middle of slewing or capturing, and a restart.

Observe that if "UnPark Mount" is unchecked, the mount is still unparked ; this is by design.
Option "UnPark Mount" is here to make sure the order of unparking is honored.
This will be replaced later by one single checkbox with that specific meaning.

BUG - Observe that if "Park Mount" is unchecked, the mount is NOT unparked at the end of the procedure.
This is by design, and is clearly not symmetric with the "UnPark Mount" option, thus needs to be reworked.

  • Remove the capture storage.
  • Start Scheduler, check job/Ekos/INDI are started, and kill/disconnect indiserver.
  • Check that Scheduler cancels the running action, and attempts to restore connection with devices.
  • If indiserver is still accessible, Scheduler will reconnect and restart the job.
  • If indiserver is not accessible anymore, Scheduler will attempt to reconnect multiple times then stop.
  • If indiserver is accessible but not responsive, Scheduler will attempt to reconnect devices, fail then stop.

Note that using a local Simulator makes it difficult to test because the system is really quick to recover.
During that test, a remote indiserver is easier to test with, and allows for various test cases with the network.

BUG - Observe the issue that Scheduler is unable to handle the case of a single device being disconnected.

  • Remove the capture storage.
  • Start Scheduler, check job/Ekos/INDI are started, wait for mount slew to start.
  • Abort the slew while it is running.
  • Check that Scheduler does notice the problem and restarts the slew.
  • Park the mount while it is slewing.
  • Check that Scheduler does NOT notice the problem until the mount is parked, but then unparks and restarts the slew.
  • Park the mount then abort the park while it is running by unparking the mount, this causes IPS_ALERT on the parking property.
  • Check that Scheduler does notice the problem and restarts the slew.

Originally a parking property in IPS_ALERT would cause the Mount interface to return PARKING_ERROR, causing confusion and hanging Scheduler.

Next test requires two runs, one with guiding NOT set, the other with guiding set. This is obviously an edge case.

  • Remove the capture storage.
  • Start Scheduler, check jobs/Ekos/INDI are started, wait for the mount to capture.
  • Park the mount.
  • Check that Scheduler does NOT notice the problem if guiding is NOT set, and continues to capture until the end of the job.
  • Check that Guider notices the problem though, but nonetheless resumes guiding before the mount is fully parked.
  • Check that situation cannot be recovered.

BUG - Observe the issue that while Guider notices mount is parking, guiding is resumed by stubborn Scheduler, and successfully guides when close to the pole because of the lower slew speed.
This is obviously an edge case, and probably cannot be considered a valid test case as a mount that is parking may not answer to guide commands. Simulator does.

Next test requires the Dome Simulator, and two distant jobs - one that executes now, one that executes later on (repeated duplicate is ok).

  • Remove the capture storage.
  • Check Unpark Dome, Unpark Mount.
  • Start Scheduler, check jobs/Ekos/INDI are started, wait for the end of the first job.
  • Check that Scheduler shuts observatory down if job is farther than preemptive shutdown delay and that option is set.
  • Check that Scheduler parks the setup then sleeps if job is farther than lead time.
  • Check that Scheduler waits if job is sooner than lead time.
  • Abort Scheduler, remove capture storage.
  • Uncheck Unpark Dome, Unpark Mount, and pay attention to current state of those.
  • Start Scheduler, check jobs/Ekos/INDI are started, wait for the end of the first job.
  • Check that Scheduler shuts observatory down if job is farther than preemptive shutdown delay and that option is set.
  • Check that Scheduler does not park the setup then sleeps if job is farther than lead time.
  • Check that Scheduler waits if job is sooner than lead time.

BUG - Observe the issue that if Unpark Dome is not checked, Scheduler is unable to unpark the dome device.
The issue was fixed for the mount device, but is probably the same for the dome and needs to be revisited.

BUG - Observe the issue that if Unpark options are checked, and the first job is farther than the lead time, Scheduler will open the dome, then either shut the dome
immediately if preemptive shutdown is enabled, or leave the dome open while sleeping.

Diff Detail

Repository
R321 KStars
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
TallFurryMan created this revision.Aug 25 2018, 9:42 AM
Restricted Application added a project: KDE Edu. · View Herald TranscriptAug 25 2018, 9:42 AM
Restricted Application added a subscriber: kde-edu. · View Herald Transcript
TallFurryMan requested review of this revision.Aug 25 2018, 9:42 AM
TallFurryMan planned changes to this revision.Aug 26 2018, 6:08 AM

There are two issues I need to test further:

  • a job that starts later than immediately properly connects to the observatory, but:
    • this is a change compared to the previous situation, not mentioned in the commit message
    • I need to verify the observatory is safe during that period, that is, parked properly
  • a job that starts later than immediately, but follows a job that unparked the observatory, might not make the observatory safe. This was not proper before neither, but I wanted to fix and test that here.

However, part of this fixes the two regressions I introduced earlier, so perhaps I should move forward...

Does this fixes the manual shutdown issue as well? It was skipping parking and going to execute script in my observatory.

Yes it "does", that one weirdly enough went through my tests last time. I don't want to merge D15073 yet, so one possible hotfix is to remove the "if(indiState == INDI_READY)" tests that I introduced in D14965, another is to add "|| indiState == INDI_IDLE" for each of them. In any case, the state machine using indiState is NOT robust at all, thus needs to be fixed by D15073.

Adjust a few remaining issues with the sleeping jobs.
Add a warning on preemptive shutdown not being used while a job is far away.
Complete remaining tests with unpark dome/mount/cap options.

TallFurryMan edited the summary of this revision. (Show Details)Aug 26 2018, 2:21 PM
TallFurryMan edited the test plan for this revision. (Show Details)

This is a problem: I'm progressively becoming proud of what I implement ;)
There are a few issues to tackle yet, but this one is ready to go.
Please test!!

wreissenberger requested changes to this revision.Aug 26 2018, 3:37 PM

I did a first check using a remote INDI server:

  • Running a single scheduled job: works fine.
  • Interrupting temporarily the INDI server during slewing and capturing: works fine, EKOS recovers.
  • Interrupting temporarily the INDI server during parking: segmentation fault in Scheduler::manageConnectionLoss() in stopGuiding(). currentJob is NULL which leads to a runtime exception.
This revision now requires changes to proceed.Aug 26 2018, 3:37 PM

Thanks! Sounds logical that a connection loss would also happen with no job running. Will adjust!

TallFurryMan edited the test plan for this revision. (Show Details)

Verified a job is current when stopping guiding.
If connection to indiserver is lost while shutting down, Scheduler considers its job is finished and does not attempt to recover connection.

wreissenberger requested changes to this revision.Aug 27 2018, 1:50 PM

OK, now the scheduler does not crash any more, if the INDI server is not available during parking. But the scheduler module does not recognize that the job has finished. The "Stop"-button is still active and the spinning wheel on the right side is still running - although the text says "No job running".

I tested it locally with telescope_simulator.

This revision now requires changes to proceed.Aug 27 2018, 1:50 PM

Agreed, in this differential I targeted recovering the situation during job execution. When the observatory is being shutdown, there is no job execution properly speaking, and I didn't really touch that part. Same thing for the dome parking, potentially it needs the same type of fix I did for the mount parking. I know I already exceeded the amount of changes in one single differential, but somehow mount parking operation is more deeply mixed in the Scheduler than the rest.

Could we consider the issue you found is not a regression, and as such could be left for a later change? I'll try to open several bugs for what you and I spotted.

I guess we can merge this given you identify the regressions in new bug reports so they can be tracked down.

wreissenberger accepted this revision.Aug 28 2018, 8:09 AM

Agreed. The issues I found are not that critical. The only thing that may happen is that a connection loss during shutdown leads to unfinished shutdown of the observatory. This can be fixed in a separate diff.

This revision is now accepted and ready to land.Aug 28 2018, 8:09 AM
mutlaqja accepted this revision.Aug 28 2018, 8:13 AM
This revision was automatically updated to reflect the committed changes.

Hey :) strictly speaking those are not regressions, the issues also exist in the prior implementation. It's just that it is not possible to get to them because that prior implementation just hangs before that point!

Oh okay, makes sense sorry about that! I've been working on the DBus stuff, but will try to test this and see if I run across other issues as well. Thank you for the great work!