Skip to content

Nomad client panic in Allocation.Canonicalize after short-lived periodic batch allocation completes #28172

Description

@Vanav

Nomad version

Observed on: Nomad v2.0.2, Nomad v2.0.3. Upgrade from 2.0.2 to 2.0.3 did not fix the issue.

Operating system and Environment details

Ubuntu 24.04 LTS.

Cluster:

  • 3 Nomad servers, all voters
  • 3 Nomad clients
  • One affected node is both server and client
  • Docker task driver
  • Consul integration enabled

The affected client runs many Docker allocations and has high allocation churn.

Issue

Nomad client process panics with nil pointer dereference in the allocation watch path.

The panic happens shortly after a short-lived periodic batch allocation completes successfully and is locally marked for client GC. It is not a task failure: the task exits with code 0.

The stack points to:

nomad/structs.(*Allocation).Canonicalize
client.(*Client).watchAllocations

From reading the v2.0.3 source, the relevant code path appears to be:

client/client.go: Client.watchAllocations
nomad/alloc_endpoint.go: Alloc.GetAllocs
nomad/structs/alloc.go: Allocation.Canonicalize

My suspicion is that Alloc.GetAllocs can return a response slice containing a nil allocation entry in a race where one requested allocation is gone or has no embedded Job, while another requested allocation satisfies the blocking query index. The client then calls Canonicalize() without a nil guard and panics.

Important observations:

  • The same job can complete many times without panic.
  • Manual/on-demand execution away from the timer boundary did not reproduce the crash.
  • The crash is seen only around periodic timer executions on a busy client.
  • This suggests a race, not deterministic failure of the job itself.

So the condition is probably looks like:

  1. Some stale/missing/terminal allocation ID exists in the client/server allocation-watch path.
  2. notification-service-delete-data creates a short-lived periodic allocation.
  3. It finishes and is marked for GC.
  4. Another allocation update wakes Client.watchAllocations.
  5. Client asks Alloc.GetAllocs for a batch of allocation IDs.
  6. Response contains a nil/missing allocation entry.
  7. Client calls Allocation.Canonicalize and panics.

Reproduction steps

I do not yet have a minimal standalone reproducer, but this is reproducible in my cluster.

The job is a periodic batch job. It creates a short-lived Docker task which exits successfully.

Simplified shape:

job "notification-service-daily" {
  type = "batch"

  periodic {
    cron             = "0 0 * * * * *"
    prohibit_overlap = true
  }

  group "notification-service-delete-data" {
    task "notification-service-delete-data" {
      driver = "docker"

      config {
        image      = "..."
        entrypoint = ["php", "artisan", "data-delete"]
      }

      restart {
        attempts = 0
        mode     = "fail"
      }

      template {
        destination = "local/.env"
        change_mode = "noop"
      }

      template {
        destination = "local/svctl-healthcheck.sh"
        change_mode = "restart"
      }
    }
  }
}

On the affected client the task usually does:

  • Received
  • Task Setup
  • Started
  • Terminated: Exit Code 0
  • not restarting task: Policy allows no restarts
  • client.gc: marking allocation for GC

Then, in some timer windows, Nomad panics 30-60 seconds later.

Expected Result

A successful short-lived periodic batch allocation should complete and be garbage-collected without crashing the Nomad client.

Even if an allocation disappears between Node.GetClientAllocs and Alloc.GetAllocs, the client should retry, ignore the missing allocation, or return an error, not panic.

Actual Result

The Nomad client panics:

panic: runtime error: invalid memory address or nil pointer dereference

github.com/hashicorp/nomad/nomad/structs.(*Allocation).Canonicalize
github.com/hashicorp/nomad/client.(*Client).watchAllocations
created by github.com/hashicorp/nomad/client.(*Client).run

The systemd service exits with: status=2/INVALIDARGUMENT

Because this node is also a Nomad server and runs many services, the crash causes service deregistration / DNS disruption until the agent restarts and re-syncs services.

Job file (if appropriate)

The affected job is a periodic batch job. The task is a Docker task that runs a command and exits successfully.

Relevant properties:

  • Job type: batch
  • Periodic: enabled
  • ProhibitOverlap: true
  • Task exits quickly with exit code 0
  • Restart policy: attempts=0, mode=fail
  • Docker driver
  • Templates rendered into local/.env and local/svctl-healthcheck.sh

I can provide the full sanitized job spec if needed.

Nomad Server logs (if appropriate)

The affected process is both server and client. Around the crashes, the service exits with panic:

Jun 24 10:01:09 App1 nomad[...] panic: runtime error: invalid memory address or nil pointer dereference
Jun 24 10:01:10 App1 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

Jun 24 11:01:09 App1 nomad[...] panic: runtime error: invalid memory address or nil pointer dereference
Jun 24 11:01:10 App1 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

Jun 24 13:00:48 App1 nomad[...] panic: runtime error: invalid memory address or nil pointer dereference
Jun 24 13:00:48 App1 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

Nomad Client logs (if appropriate)

Crash case 1:

2026-06-24T10:00:00.248+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
  task=notification-service-delete-data
  type=Received
  msg="Task received by client"

2026-06-24T10:00:04.831+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
  task=notification-service-delete-data
  type=Started
  msg="Task started by client"

2026-06-24T10:00:11.187+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
  task=notification-service-delete-data
  type=Terminated
  msg="Exit Code: 0"

2026-06-24T10:00:11.242+0300 [INFO] client.alloc_runner.task_runner:
  not restarting task:
  alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
  task=notification-service-delete-data
  reason="Policy allows no restarts"

2026-06-24T10:00:11.268+0300 [INFO] client.gc:
  marking allocation for GC:
  alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825

2026-06-24T10:01:09+0300
  panic in Allocation.Canonicalize from Client.watchAllocations

Crash case 2:

2026-06-24T11:00:00.180+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=734033aa-2b08-c669-99c3-f5ca0f3f7968
  task=notification-service-delete-data
  type=Received
  msg="Task received by client"

2026-06-24T11:00:05.329+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=734033aa-2b08-c669-99c3-f5ca0f3f7968
  task=notification-service-delete-data
  type=Started
  msg="Task started by client"

2026-06-24T11:00:12.886+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=734033aa-2b08-c669-99c3-f5ca0f3f7968
  task=notification-service-delete-data
  type=Terminated
  msg="Exit Code: 0"

2026-06-24T11:00:12.958+0300 [INFO] client.gc:
  marking allocation for GC:
  alloc_id=734033aa-2b08-c669-99c3-f5ca0f3f7968

2026-06-24T11:01:09+0300
  panic in Allocation.Canonicalize from Client.watchAllocations

Crash case 3, after upgrading to Nomad 2.0.3:

2026-06-24T13:00:00.237+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=3fed2b0c-f8d5-f8fb-56fd-88821f85dbaa
  task=notification-service-delete-data
  type=Received
  msg="Task received by client"

2026-06-24T13:00:04.889+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=3fed2b0c-f8d5-f8fb-56fd-88821f85dbaa
  task=notification-service-delete-data
  type=Started
  msg="Task started by client"

2026-06-24T13:00:09.643+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=3fed2b0c-f8d5-f8fb-56fd-88821f85dbaa
  task=notification-service-delete-data
  type=Terminated
  msg="Exit Code: 0"

2026-06-24T13:00:09.706+0300 [INFO] client.gc:
  marking allocation for GC:
  alloc_id=3fed2b0c-f8d5-f8fb-56fd-88821f85dbaa

2026-06-24T13:00:48+0300
  panic in Allocation.Canonicalize from Client.watchAllocations

Counter-example: the same task completed many other times without panic, so this looks like a race:

2026-06-23 01:00:05 notification-service-delete-data terminated and was GC-marked; no panic
2026-06-23 02:00:04 notification-service-delete-data terminated and was GC-marked; no panic
2026-06-24 01:00:04 notification-service-delete-data terminated and was GC-marked; no panic
2026-06-24 02:00:04 notification-service-delete-data terminated and was GC-marked; no panic
2026-06-24 03:00:04 notification-service-delete-data terminated and was GC-marked; no panic

Additional notes:

  • Running the same job manually/on-demand away from the timer window did not reproduce the panic.
  • I do not run external scheduled nomad system gc.
  • I checked current allocations after the crash: all current App1 allocations returned HTTP 200 and had embedded Job objects, so I do not see persistent state corruption after the fact.
  • The problematic state may exist only transiently inside the allocation watch / allocation fetch path.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions