job: Drop state of stopped jobs and fix runtime job stop order by joamaki · Pull Request #72 · cilium/hive

joamaki · 2026-05-21T09:59:05Z

This fixes two issues:

The job.registry was stopped effectively as the last thing in an application due to it being started first. As part of this the jobs started at runtime were stopped. This in practice is way too late since that means potential dependencies of jobs might've been stopped before the job itself. Fix this by introducing a "PreStopHook" interface to mark a hook as something that needs to be stopped before all other hooks. Suggestions welcome for a cleaner approach.
Stopped jobs accumulate state in the form of health entries and lifecycle hooks. Fix first issue by closing the health scopes of jobs instead of marking them stopped (earlier attempt at job: Close the health reporter instead of stopping #20). This does mean we don't see "Stopped" entries for jobs in the health table anymore, but we do log when a job is stopped so this information is still available. The second issue is fixed by implementing a job-specific lifecycle that allows cheap removal of jobs from the list of jobs to stop. This has an observable effect in logs as we no longer see the DefaultLifecycle logging that a start and stop hook was called for the job. To avoid reducing useful logging the queuedJob now logs when a job is started and when a job is stopped.

gandro

I'm not familiar with the internals, so apologies if I misunderstood something. But I've found being able to see the status of completed jobs in the sysdump or the dbg shell extremely useful. In particular, since agent logs often end up being truncated.

So I think unconditionally removing completed jobs (especially if they error-ed out) is a step backwards for troubleshooting. I would prefer it if jobs were cleaned up only after a certain time (and for some static jobs, maybe never?)

gandro

I'm not intimately familiar with the lifecycle infrastructure, so I can't really meaningfully review the last commit, but the rest looks good to me

joamaki · 2026-05-26T07:17:49Z

@dylandreimerink could you review the last commit?

To support huge amounts of jobs that start and stop during the runtime close the job health scopes instead of marking them stopped to avoid accumulating an ever growing number of health entries. Signed-off-by: Jussi Maki <jussi@isovalent.com>

joamaki · 2026-05-26T09:19:55Z

-			defer close(qj.done)
+			defer func() {
+				qj.registry.runtimeLifecycle.remove(qj)
+				qj.registry.logger.Info("Job stopped",


Is Info level too much here or should we stick with Debug?

For jobs added before starting this does lead to double-logging:

logger.go:256: time=2026-05-26T11:26:48.077+02:00 level=INFO source=/home/jussi/go/src/github.com/cilium/hive/job/job.go:214 msg="Job stopped" module=jobs job=g1-third duration=21.11289ms logger.go:256: time=2026-05-26T11:26:48.077+02:00 level=INFO source=/home/jussi/go/src/github.com/cilium/hive/cell/lifecycle.go:198 msg="Stop hook executed" function="*job.queuedJob.Stop (jobs)" detail=g1-third duration=10.209µs

On the one hand yes, users probably don't care. But if you make it debug you have to enable debug logging and this info will not be in most sysdumps we get. So how frequently do we expect issue with it?

Hmm maybe let's keep the info logging but only do it for jobs added during runtime and rely on the lifecycle logging otherwise? EDIT: done. looks like this now:

logger.go:256: time=2026-05-26T11:43:17.584+02:00 level=INFO source=/home/jussi/go/src/github.com/cilium/hive/cell/lifecycle.go:158 msg="Start hook executed" function="*job.queuedJob.Start (jobs)" detail=g1-third duration=1.667µs logger.go:256: time=2026-05-26T11:43:17.584+02:00 level=INFO source=/home/jussi/go/src/github.com/cilium/hive/hive.go:407 msg="Started hive" duration=151.387µs logger.go:256: time=2026-05-26T11:43:17.595+02:00 level=INFO source=/home/jussi/go/src/github.com/cilium/hive/job/job.go:226 msg="Job started" module=jobs job=g1-dynamic ... logger.go:256: time=2026-05-26T11:43:17.605+02:00 level=INFO source=/home/jussi/go/src/github.com/cilium/hive/hive.go:421 msg="Stopping hive" logger.go:256: time=2026-05-26T11:43:17.605+02:00 level=INFO source=/home/jussi/go/src/github.com/cilium/hive/job/job.go:217 msg="Job stopped" module=jobs job=g1-dynamic duration=10.543427ms logger.go:256: time=2026-05-26T11:43:17.605+02:00 level=INFO source=/home/jussi/go/src/github.com/cilium/hive/cell/lifecycle.go:198 msg="Stop hook executed" function="*job.queuedJob.Stop (jobs)" detail=g1-third duration=18.996µs

It is a little bit confusing of course, but probably not a big deal. The duration for "Job stopped" is the whole duration of the job and in the stop hook that's of course the duration of the stop hook execution. Not sure what log fields would be more obvious to use here...

The runtime lifecycle of the job registry is being stopped too late which leads to jobs started at runtime to being stopped after all the other stop hooks. This may lead to a dependency of a job being stopped before the job. To fix this add support for "pre-stop" hooks that are called before other stop hooks. Signed-off-by: Jussi Maki <jussi@isovalent.com>

To support large numbers of jobs started and stopped at runtime we need to avoid accumulating state related to them. Earlier commit already fixed the accumulating of health entries by closing the health scope instead of marking it stopped. We were additionally appending each "queuedJob" to the "dynamicLC" and never removing from it. This introduces a new job-specific lifecycle (jobLifecycle) that maintains a linked-list of jobs started at runtime with support for removing from this list when a job stops before the application stops. Signed-off-by: Jussi Maki <jussi@isovalent.com>

joamaki requested a review from a team as a code owner May 21, 2026 09:59

joamaki requested review from derailed and removed request for a team May 21, 2026 09:59

joamaki force-pushed the pr/joamaki/hive-job-lifecycle-fixes branch 2 times, most recently from 60fe93b to bbcc817 Compare May 21, 2026 10:36

gandro reviewed May 21, 2026

View reviewed changes

Comment thread job/oneshot.go

joamaki mentioned this pull request May 21, 2026

health: Add health/history command cilium/cilium#46102

Merged

joamaki requested a review from gandro May 25, 2026 12:59

gandro approved these changes May 26, 2026

View reviewed changes

joamaki requested a review from dylandreimerink May 26, 2026 07:17

joamaki force-pushed the pr/joamaki/hive-job-lifecycle-fixes branch from bbcc817 to 22c7cf8 Compare May 26, 2026 09:14

job: Close the job health reporters

712b6cb

To support huge amounts of jobs that start and stop during the runtime close the job health scopes instead of marking them stopped to avoid accumulating an ever growing number of health entries. Signed-off-by: Jussi Maki <jussi@isovalent.com>

joamaki commented May 26, 2026

View reviewed changes

Comment thread job/observer.go

joamaki commented May 26, 2026

View reviewed changes

dylandreimerink approved these changes May 26, 2026

View reviewed changes

joamaki force-pushed the pr/joamaki/hive-job-lifecycle-fixes branch from 22c7cf8 to 4dedbfa Compare May 26, 2026 09:42

joamaki force-pushed the pr/joamaki/hive-job-lifecycle-fixes branch from 4dedbfa to ff7aa60 Compare May 26, 2026 10:02

joamaki merged commit ba9edde into main May 26, 2026
1 check passed

joamaki deleted the pr/joamaki/hive-job-lifecycle-fixes branch May 26, 2026 10:04

joamaki mentioned this pull request Jun 9, 2026

Bump to Hive v1.0.4-rc2 and fix job stop ordering issues cilium/cilium#46206

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

job: Drop state of stopped jobs and fix runtime job stop order#72

job: Drop state of stopped jobs and fix runtime job stop order#72
joamaki merged 3 commits into
mainfrom
pr/joamaki/hive-job-lifecycle-fixes

joamaki commented May 21, 2026 •

edited

Loading

Uh oh!

gandro left a comment

Uh oh!

Uh oh!

gandro left a comment

Uh oh!

joamaki commented May 26, 2026

Uh oh!

Uh oh!

joamaki May 26, 2026 •

edited

Loading

Uh oh!

dylandreimerink May 26, 2026

Uh oh!

joamaki May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

joamaki commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gandro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gandro left a comment

Choose a reason for hiding this comment

Uh oh!

joamaki commented May 26, 2026

Uh oh!

Uh oh!

joamaki May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dylandreimerink May 26, 2026

Choose a reason for hiding this comment

Uh oh!

joamaki May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joamaki commented May 21, 2026 •

edited

Loading

joamaki May 26, 2026 •

edited

Loading

joamaki May 26, 2026 •

edited

Loading