job: Drop state of stopped jobs and fix runtime job stop order#72
Conversation
60fe93b to
bbcc817
Compare
gandro
left a comment
There was a problem hiding this comment.
I'm not familiar with the internals, so apologies if I misunderstood something. But I've found being able to see the status of completed jobs in the sysdump or the dbg shell extremely useful. In particular, since agent logs often end up being truncated.
So I think unconditionally removing completed jobs (especially if they error-ed out) is a step backwards for troubleshooting. I would prefer it if jobs were cleaned up only after a certain time (and for some static jobs, maybe never?)
gandro
left a comment
There was a problem hiding this comment.
I'm not intimately familiar with the lifecycle infrastructure, so I can't really meaningfully review the last commit, but the rest looks good to me
|
@dylandreimerink could you review the last commit? |
bbcc817 to
22c7cf8
Compare
To support huge amounts of jobs that start and stop during the runtime close the job health scopes instead of marking them stopped to avoid accumulating an ever growing number of health entries. Signed-off-by: Jussi Maki <jussi@isovalent.com>
| defer close(qj.done) | ||
| defer func() { | ||
| qj.registry.runtimeLifecycle.remove(qj) | ||
| qj.registry.logger.Info("Job stopped", |
There was a problem hiding this comment.
Is Info level too much here or should we stick with Debug?
For jobs added before starting this does lead to double-logging:
logger.go:256: time=2026-05-26T11:26:48.077+02:00 level=INFO
source=/home/jussi/go/src/github.com/cilium/hive/job/job.go:214 msg="Job stopped"
module=jobs job=g1-third duration=21.11289ms
logger.go:256: time=2026-05-26T11:26:48.077+02:00 level=INFO
source=/home/jussi/go/src/github.com/cilium/hive/cell/lifecycle.go:198 msg="Stop hook executed"
function="*job.queuedJob.Stop (jobs)" detail=g1-third duration=10.209µs
There was a problem hiding this comment.
On the one hand yes, users probably don't care. But if you make it debug you have to enable debug logging and this info will not be in most sysdumps we get. So how frequently do we expect issue with it?
There was a problem hiding this comment.
Hmm maybe let's keep the info logging but only do it for jobs added during runtime and rely on the lifecycle logging otherwise? EDIT: done. looks like this now:
logger.go:256: time=2026-05-26T11:43:17.584+02:00 level=INFO
source=/home/jussi/go/src/github.com/cilium/hive/cell/lifecycle.go:158 msg="Start hook executed"
function="*job.queuedJob.Start (jobs)" detail=g1-third duration=1.667µs
logger.go:256: time=2026-05-26T11:43:17.584+02:00 level=INFO
source=/home/jussi/go/src/github.com/cilium/hive/hive.go:407 msg="Started hive" duration=151.387µs
logger.go:256: time=2026-05-26T11:43:17.595+02:00 level=INFO
source=/home/jussi/go/src/github.com/cilium/hive/job/job.go:226 msg="Job started"
module=jobs job=g1-dynamic
...
logger.go:256: time=2026-05-26T11:43:17.605+02:00 level=INFO
source=/home/jussi/go/src/github.com/cilium/hive/hive.go:421 msg="Stopping hive"
logger.go:256: time=2026-05-26T11:43:17.605+02:00 level=INFO
source=/home/jussi/go/src/github.com/cilium/hive/job/job.go:217 msg="Job stopped"
module=jobs job=g1-dynamic duration=10.543427ms
logger.go:256: time=2026-05-26T11:43:17.605+02:00 level=INFO
source=/home/jussi/go/src/github.com/cilium/hive/cell/lifecycle.go:198 msg="Stop hook executed"
function="*job.queuedJob.Stop (jobs)" detail=g1-third duration=18.996µs
It is a little bit confusing of course, but probably not a big deal. The duration for "Job stopped" is the whole duration of the job and in the stop hook that's of course the duration of the stop hook execution. Not sure what log fields would be more obvious to use here...
The runtime lifecycle of the job registry is being stopped too late which leads to jobs started at runtime to being stopped after all the other stop hooks. This may lead to a dependency of a job being stopped before the job. To fix this add support for "pre-stop" hooks that are called before other stop hooks. Signed-off-by: Jussi Maki <jussi@isovalent.com>
22c7cf8 to
4dedbfa
Compare
To support large numbers of jobs started and stopped at runtime we need to avoid accumulating state related to them. Earlier commit already fixed the accumulating of health entries by closing the health scope instead of marking it stopped. We were additionally appending each "queuedJob" to the "dynamicLC" and never removing from it. This introduces a new job-specific lifecycle (jobLifecycle) that maintains a linked-list of jobs started at runtime with support for removing from this list when a job stops before the application stops. Signed-off-by: Jussi Maki <jussi@isovalent.com>
4dedbfa to
ff7aa60
Compare
This fixes two issues:
The
job.registrywas stopped effectively as the last thing in an application due to it being started first. As part of this the jobs started at runtime were stopped. This in practice is way too late since that means potential dependencies of jobs might've been stopped before the job itself. Fix this by introducing a "PreStopHook" interface to mark a hook as something that needs to be stopped before all other hooks. Suggestions welcome for a cleaner approach.Stopped jobs accumulate state in the form of health entries and lifecycle hooks. Fix first issue by closing the health scopes of jobs instead of marking them stopped (earlier attempt at job: Close the health reporter instead of stopping #20). This does mean we don't see "Stopped" entries for jobs in the health table anymore, but we do log when a job is stopped so this information is still available. The second issue is fixed by implementing a job-specific lifecycle that allows cheap removal of jobs from the list of jobs to stop. This has an observable effect in logs as we no longer see the
DefaultLifecyclelogging that a start and stop hook was called for the job. To avoid reducing useful logging thequeuedJobnow logs when a job is started and when a job is stopped.