Skip to content

Flaky integration test - race condition in background-job-lifecycle #510

@zikajk

Description

@zikajk

CI failure is pre-existing and unrelated to this PR

The background-job-lifecycle integration test failure is caused by a race condition in src/eca/features/background_tasks.clj — nothing touched by this PR.

Root cause: kill-job! calls kill-job-impl (sends SIGTERM) and then atomically sets :status :killed. The monitor-process-exit! daemon thread can race ahead: it dereferences the now-exited process, gets exit code 143, and calls update-job-on-exit before the kill status has been committed — overwriting it with :failed instead.

Fix: swap the order in kill-job! — set :status :killed in the registry before calling kill-job-impl, so the monitor thread always sees :killed and the guard in update-job-on-exit correctly skips the overwrite:

(defn kill-job! [job-id]
  (if-let [job (get-job job-id)]
    (if (= :running (:status job))
      (do
        ;; Commit :killed status before destroying the process so that
        ;; monitor-process-exit! never overwrites it with :failed (exit 143).
        (swap! registry* (fn [reg]
                           (-> reg
                               (assoc-in [:jobs job-id :status] :killed)
                               (assoc-in [:jobs job-id :ended-at] (Instant/now)))))
        (kill-job-impl job)
        (logger/info logger-tag "Killed background job" {:id job-id})
        true)
      ...)))

The existing unit test "does not overwrite :killed status" in background_tasks_test.clj already covers this scenario.

Originally posted by @itkonen in #508 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions