Skip to content

job: Drop state of stopped jobs and fix runtime job stop order#72

Merged
joamaki merged 3 commits into
mainfrom
pr/joamaki/hive-job-lifecycle-fixes
May 26, 2026
Merged

job: Drop state of stopped jobs and fix runtime job stop order#72
joamaki merged 3 commits into
mainfrom
pr/joamaki/hive-job-lifecycle-fixes

Conversation

@joamaki

@joamaki joamaki commented May 21, 2026

Copy link
Copy Markdown
Contributor

This fixes two issues:

  1. The job.registry was stopped effectively as the last thing in an application due to it being started first. As part of this the jobs started at runtime were stopped. This in practice is way too late since that means potential dependencies of jobs might've been stopped before the job itself. Fix this by introducing a "PreStopHook" interface to mark a hook as something that needs to be stopped before all other hooks. Suggestions welcome for a cleaner approach.

  2. Stopped jobs accumulate state in the form of health entries and lifecycle hooks. Fix first issue by closing the health scopes of jobs instead of marking them stopped (earlier attempt at job: Close the health reporter instead of stopping #20). This does mean we don't see "Stopped" entries for jobs in the health table anymore, but we do log when a job is stopped so this information is still available. The second issue is fixed by implementing a job-specific lifecycle that allows cheap removal of jobs from the list of jobs to stop. This has an observable effect in logs as we no longer see the DefaultLifecycle logging that a start and stop hook was called for the job. To avoid reducing useful logging the queuedJob now logs when a job is started and when a job is stopped.

@joamaki joamaki requested a review from a team as a code owner May 21, 2026 09:59
@joamaki joamaki requested review from derailed and removed request for a team May 21, 2026 09:59
@joamaki joamaki force-pushed the pr/joamaki/hive-job-lifecycle-fixes branch 2 times, most recently from 60fe93b to bbcc817 Compare May 21, 2026 10:36

@gandro gandro left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the internals, so apologies if I misunderstood something. But I've found being able to see the status of completed jobs in the sysdump or the dbg shell extremely useful. In particular, since agent logs often end up being truncated.

So I think unconditionally removing completed jobs (especially if they error-ed out) is a step backwards for troubleshooting. I would prefer it if jobs were cleaned up only after a certain time (and for some static jobs, maybe never?)

Comment thread job/oneshot.go

@gandro gandro left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not intimately familiar with the lifecycle infrastructure, so I can't really meaningfully review the last commit, but the rest looks good to me

@joamaki

joamaki commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

@dylandreimerink could you review the last commit?

@joamaki joamaki requested a review from dylandreimerink May 26, 2026 07:17
@joamaki joamaki force-pushed the pr/joamaki/hive-job-lifecycle-fixes branch from bbcc817 to 22c7cf8 Compare May 26, 2026 09:14
To support huge amounts of jobs that start and stop during
the runtime close the job health scopes instead of marking
them stopped to avoid accumulating an ever growing number of
health entries.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
Comment thread job/observer.go
Comment thread job/job.go Outdated
defer close(qj.done)
defer func() {
qj.registry.runtimeLifecycle.remove(qj)
qj.registry.logger.Info("Job stopped",

@joamaki joamaki May 26, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Info level too much here or should we stick with Debug?

For jobs added before starting this does lead to double-logging:

    logger.go:256: time=2026-05-26T11:26:48.077+02:00 level=INFO 
source=/home/jussi/go/src/github.com/cilium/hive/job/job.go:214 msg="Job stopped" 
module=jobs job=g1-third duration=21.11289ms
    logger.go:256: time=2026-05-26T11:26:48.077+02:00 level=INFO 
source=/home/jussi/go/src/github.com/cilium/hive/cell/lifecycle.go:198 msg="Stop hook executed" 
function="*job.queuedJob.Stop (jobs)" detail=g1-third duration=10.209µs

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the one hand yes, users probably don't care. But if you make it debug you have to enable debug logging and this info will not be in most sysdumps we get. So how frequently do we expect issue with it?

@joamaki joamaki May 26, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe let's keep the info logging but only do it for jobs added during runtime and rely on the lifecycle logging otherwise? EDIT: done. looks like this now:

    logger.go:256: time=2026-05-26T11:43:17.584+02:00 level=INFO 
source=/home/jussi/go/src/github.com/cilium/hive/cell/lifecycle.go:158 msg="Start hook executed" 
function="*job.queuedJob.Start (jobs)" detail=g1-third duration=1.667µs
    logger.go:256: time=2026-05-26T11:43:17.584+02:00 level=INFO 
source=/home/jussi/go/src/github.com/cilium/hive/hive.go:407 msg="Started hive" duration=151.387µs
    logger.go:256: time=2026-05-26T11:43:17.595+02:00 level=INFO 
source=/home/jussi/go/src/github.com/cilium/hive/job/job.go:226 msg="Job started" 
module=jobs job=g1-dynamic
...
    logger.go:256: time=2026-05-26T11:43:17.605+02:00 level=INFO 
source=/home/jussi/go/src/github.com/cilium/hive/hive.go:421 msg="Stopping hive"
    logger.go:256: time=2026-05-26T11:43:17.605+02:00 level=INFO 
source=/home/jussi/go/src/github.com/cilium/hive/job/job.go:217 msg="Job stopped" 
module=jobs job=g1-dynamic duration=10.543427ms
    logger.go:256: time=2026-05-26T11:43:17.605+02:00 level=INFO 
source=/home/jussi/go/src/github.com/cilium/hive/cell/lifecycle.go:198 msg="Stop hook executed" 
function="*job.queuedJob.Stop (jobs)" detail=g1-third duration=18.996µs

It is a little bit confusing of course, but probably not a big deal. The duration for "Job stopped" is the whole duration of the job and in the stop hook that's of course the duration of the stop hook execution. Not sure what log fields would be more obvious to use here...

The runtime lifecycle of the job registry is being stopped
too late which leads to jobs started at runtime to being stopped
after all the other stop hooks. This may lead to a dependency of
a job being stopped before the job.

To fix this add support for "pre-stop" hooks that are called
before other stop hooks.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki force-pushed the pr/joamaki/hive-job-lifecycle-fixes branch from 22c7cf8 to 4dedbfa Compare May 26, 2026 09:42
To support large numbers of jobs started and stopped at runtime
we need to avoid accumulating state related to them. Earlier
commit already fixed the accumulating of health entries by closing
the health scope instead of marking it stopped. We were additionally
appending each "queuedJob" to the "dynamicLC" and never removing from
it.

This introduces a new job-specific lifecycle (jobLifecycle) that maintains
a linked-list of jobs started at runtime with support for removing
from this list when a job stops before the application stops.

Signed-off-by: Jussi Maki <jussi@isovalent.com>
@joamaki joamaki force-pushed the pr/joamaki/hive-job-lifecycle-fixes branch from 4dedbfa to ff7aa60 Compare May 26, 2026 10:02
@joamaki joamaki merged commit ba9edde into main May 26, 2026
1 check passed
@joamaki joamaki deleted the pr/joamaki/hive-job-lifecycle-fixes branch May 26, 2026 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants