Nomad version
Observed on: Nomad v2.0.2, Nomad v2.0.3. Upgrade from 2.0.2 to 2.0.3 did not fix the issue.
Operating system and Environment details
Ubuntu 24.04 LTS.
Cluster:
- 3 Nomad servers, all voters
- 3 Nomad clients
- One affected node is both server and client
- Docker task driver
- Consul integration enabled
The affected client runs many Docker allocations and has high allocation churn.
Issue
Nomad client process panics with nil pointer dereference in the allocation watch path.
The panic happens shortly after a short-lived periodic batch allocation completes successfully and is locally marked for client GC. It is not a task failure: the task exits with code 0.
The stack points to:
nomad/structs.(*Allocation).Canonicalize
client.(*Client).watchAllocations
From reading the v2.0.3 source, the relevant code path appears to be:
client/client.go: Client.watchAllocations
nomad/alloc_endpoint.go: Alloc.GetAllocs
nomad/structs/alloc.go: Allocation.Canonicalize
My suspicion is that Alloc.GetAllocs can return a response slice containing a nil allocation entry in a race where one requested allocation is gone or has no embedded Job, while another requested allocation satisfies the blocking query index. The client then calls Canonicalize() without a nil guard and panics.
Important observations:
- The same job can complete many times without panic.
- Manual/on-demand execution away from the timer boundary did not reproduce the crash.
- The crash is seen only around periodic timer executions on a busy client.
- This suggests a race, not deterministic failure of the job itself.
So the condition is probably looks like:
- Some stale/missing/terminal allocation ID exists in the client/server allocation-watch path.
notification-service-delete-data creates a short-lived periodic allocation.
- It finishes and is marked for GC.
- Another allocation update wakes
Client.watchAllocations.
- Client asks
Alloc.GetAllocs for a batch of allocation IDs.
- Response contains a nil/missing allocation entry.
- Client calls
Allocation.Canonicalize and panics.
Reproduction steps
I do not yet have a minimal standalone reproducer, but this is reproducible in my cluster.
The job is a periodic batch job. It creates a short-lived Docker task which exits successfully.
Simplified shape:
job "notification-service-daily" {
type = "batch"
periodic {
cron = "0 0 * * * * *"
prohibit_overlap = true
}
group "notification-service-delete-data" {
task "notification-service-delete-data" {
driver = "docker"
config {
image = "..."
entrypoint = ["php", "artisan", "data-delete"]
}
restart {
attempts = 0
mode = "fail"
}
template {
destination = "local/.env"
change_mode = "noop"
}
template {
destination = "local/svctl-healthcheck.sh"
change_mode = "restart"
}
}
}
}
On the affected client the task usually does:
- Received
- Task Setup
- Started
- Terminated: Exit Code 0
- not restarting task: Policy allows no restarts
- client.gc: marking allocation for GC
Then, in some timer windows, Nomad panics 30-60 seconds later.
Expected Result
A successful short-lived periodic batch allocation should complete and be garbage-collected without crashing the Nomad client.
Even if an allocation disappears between Node.GetClientAllocs and Alloc.GetAllocs, the client should retry, ignore the missing allocation, or return an error, not panic.
Actual Result
The Nomad client panics:
panic: runtime error: invalid memory address or nil pointer dereference
github.com/hashicorp/nomad/nomad/structs.(*Allocation).Canonicalize
github.com/hashicorp/nomad/client.(*Client).watchAllocations
created by github.com/hashicorp/nomad/client.(*Client).run
The systemd service exits with: status=2/INVALIDARGUMENT
Because this node is also a Nomad server and runs many services, the crash causes service deregistration / DNS disruption until the agent restarts and re-syncs services.
Job file (if appropriate)
The affected job is a periodic batch job. The task is a Docker task that runs a command and exits successfully.
Relevant properties:
- Job type: batch
- Periodic: enabled
- ProhibitOverlap: true
- Task exits quickly with exit code 0
- Restart policy: attempts=0, mode=fail
- Docker driver
- Templates rendered into local/.env and local/svctl-healthcheck.sh
I can provide the full sanitized job spec if needed.
Nomad Server logs (if appropriate)
The affected process is both server and client. Around the crashes, the service exits with panic:
Jun 24 10:01:09 App1 nomad[...] panic: runtime error: invalid memory address or nil pointer dereference
Jun 24 10:01:10 App1 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jun 24 11:01:09 App1 nomad[...] panic: runtime error: invalid memory address or nil pointer dereference
Jun 24 11:01:10 App1 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jun 24 13:00:48 App1 nomad[...] panic: runtime error: invalid memory address or nil pointer dereference
Jun 24 13:00:48 App1 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Nomad Client logs (if appropriate)
Crash case 1:
2026-06-24T10:00:00.248+0300 [INFO] client.alloc_runner.task_runner:
Task event: alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
task=notification-service-delete-data
type=Received
msg="Task received by client"
2026-06-24T10:00:04.831+0300 [INFO] client.alloc_runner.task_runner:
Task event: alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
task=notification-service-delete-data
type=Started
msg="Task started by client"
2026-06-24T10:00:11.187+0300 [INFO] client.alloc_runner.task_runner:
Task event: alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
task=notification-service-delete-data
type=Terminated
msg="Exit Code: 0"
2026-06-24T10:00:11.242+0300 [INFO] client.alloc_runner.task_runner:
not restarting task:
alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
task=notification-service-delete-data
reason="Policy allows no restarts"
2026-06-24T10:00:11.268+0300 [INFO] client.gc:
marking allocation for GC:
alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
2026-06-24T10:01:09+0300
panic in Allocation.Canonicalize from Client.watchAllocations
Crash case 2:
2026-06-24T11:00:00.180+0300 [INFO] client.alloc_runner.task_runner:
Task event: alloc_id=734033aa-2b08-c669-99c3-f5ca0f3f7968
task=notification-service-delete-data
type=Received
msg="Task received by client"
2026-06-24T11:00:05.329+0300 [INFO] client.alloc_runner.task_runner:
Task event: alloc_id=734033aa-2b08-c669-99c3-f5ca0f3f7968
task=notification-service-delete-data
type=Started
msg="Task started by client"
2026-06-24T11:00:12.886+0300 [INFO] client.alloc_runner.task_runner:
Task event: alloc_id=734033aa-2b08-c669-99c3-f5ca0f3f7968
task=notification-service-delete-data
type=Terminated
msg="Exit Code: 0"
2026-06-24T11:00:12.958+0300 [INFO] client.gc:
marking allocation for GC:
alloc_id=734033aa-2b08-c669-99c3-f5ca0f3f7968
2026-06-24T11:01:09+0300
panic in Allocation.Canonicalize from Client.watchAllocations
Crash case 3, after upgrading to Nomad 2.0.3:
2026-06-24T13:00:00.237+0300 [INFO] client.alloc_runner.task_runner:
Task event: alloc_id=3fed2b0c-f8d5-f8fb-56fd-88821f85dbaa
task=notification-service-delete-data
type=Received
msg="Task received by client"
2026-06-24T13:00:04.889+0300 [INFO] client.alloc_runner.task_runner:
Task event: alloc_id=3fed2b0c-f8d5-f8fb-56fd-88821f85dbaa
task=notification-service-delete-data
type=Started
msg="Task started by client"
2026-06-24T13:00:09.643+0300 [INFO] client.alloc_runner.task_runner:
Task event: alloc_id=3fed2b0c-f8d5-f8fb-56fd-88821f85dbaa
task=notification-service-delete-data
type=Terminated
msg="Exit Code: 0"
2026-06-24T13:00:09.706+0300 [INFO] client.gc:
marking allocation for GC:
alloc_id=3fed2b0c-f8d5-f8fb-56fd-88821f85dbaa
2026-06-24T13:00:48+0300
panic in Allocation.Canonicalize from Client.watchAllocations
Counter-example: the same task completed many other times without panic, so this looks like a race:
2026-06-23 01:00:05 notification-service-delete-data terminated and was GC-marked; no panic
2026-06-23 02:00:04 notification-service-delete-data terminated and was GC-marked; no panic
2026-06-24 01:00:04 notification-service-delete-data terminated and was GC-marked; no panic
2026-06-24 02:00:04 notification-service-delete-data terminated and was GC-marked; no panic
2026-06-24 03:00:04 notification-service-delete-data terminated and was GC-marked; no panic
Additional notes:
- Running the same job manually/on-demand away from the timer window did not reproduce the panic.
- I do not run external scheduled nomad system gc.
- I checked current allocations after the crash: all current App1 allocations returned HTTP 200 and had embedded Job objects, so I do not see persistent state corruption after the fact.
- The problematic state may exist only transiently inside the allocation watch / allocation fetch path.
Nomad version
Observed on: Nomad v2.0.2, Nomad v2.0.3. Upgrade from 2.0.2 to 2.0.3 did not fix the issue.
Operating system and Environment details
Ubuntu 24.04 LTS.
Cluster:
The affected client runs many Docker allocations and has high allocation churn.
Issue
Nomad client process panics with nil pointer dereference in the allocation watch path.
The panic happens shortly after a short-lived periodic batch allocation completes successfully and is locally marked for client GC. It is not a task failure: the task exits with code 0.
The stack points to:
From reading the v2.0.3 source, the relevant code path appears to be:
My suspicion is that
Alloc.GetAllocscan return a response slice containing a nil allocation entry in a race where one requested allocation is gone or has no embedded Job, while another requested allocation satisfies the blocking query index. The client then callsCanonicalize()without a nil guard and panics.Important observations:
So the condition is probably looks like:
notification-service-delete-datacreates a short-lived periodic allocation.Client.watchAllocations.Alloc.GetAllocsfor a batch of allocation IDs.Allocation.Canonicalizeand panics.Reproduction steps
I do not yet have a minimal standalone reproducer, but this is reproducible in my cluster.
The job is a periodic batch job. It creates a short-lived Docker task which exits successfully.
Simplified shape:
On the affected client the task usually does:
Then, in some timer windows, Nomad panics 30-60 seconds later.
Expected Result
A successful short-lived periodic batch allocation should complete and be garbage-collected without crashing the Nomad client.
Even if an allocation disappears between
Node.GetClientAllocsandAlloc.GetAllocs, the client should retry, ignore the missing allocation, or return an error, not panic.Actual Result
The Nomad client panics:
The systemd service exits with:
status=2/INVALIDARGUMENTBecause this node is also a Nomad server and runs many services, the crash causes service deregistration / DNS disruption until the agent restarts and re-syncs services.
Job file (if appropriate)
The affected job is a periodic batch job. The task is a Docker task that runs a command and exits successfully.
Relevant properties:
I can provide the full sanitized job spec if needed.
Nomad Server logs (if appropriate)
The affected process is both server and client. Around the crashes, the service exits with panic:
Nomad Client logs (if appropriate)
Crash case 1:
Crash case 2:
Crash case 3, after upgrading to Nomad 2.0.3:
Counter-example: the same task completed many other times without panic, so this looks like a race:
Additional notes: