Nomad client panic in `Allocation.Canonicalize` after short-lived periodic batch allocation completes

### Nomad version

Observed on: Nomad v2.0.2, Nomad v2.0.3. Upgrade from 2.0.2 to 2.0.3 did not fix the issue.

### Operating system and Environment details

Ubuntu 24.04 LTS.

Cluster:
- 3 Nomad servers, all voters
- 3 Nomad clients
- One affected node is both server and client
- Docker task driver
- Consul integration enabled

The affected client runs many Docker allocations and has high allocation churn.

### Issue

Nomad client process panics with nil pointer dereference in the allocation watch path.

The panic happens shortly after a short-lived periodic batch allocation completes successfully and is locally marked for client GC. It is not a task failure: the task exits with code 0.

The stack points to:
```text
nomad/structs.(*Allocation).Canonicalize
client.(*Client).watchAllocations
```

From reading the v2.0.3 source, the relevant code path appears to be:
```text
client/client.go: Client.watchAllocations
nomad/alloc_endpoint.go: Alloc.GetAllocs
nomad/structs/alloc.go: Allocation.Canonicalize
```

My suspicion is that `Alloc.GetAllocs` can return a response slice containing a nil allocation entry in a race where one requested allocation is gone or has no embedded Job, while another requested allocation satisfies the blocking query index. The client then calls `Canonicalize()` without a nil guard and panics.

Important observations:
- The same job can complete many times without panic.
- Manual/on-demand execution away from the timer boundary did not reproduce the crash.
- The crash is seen only around periodic timer executions on a busy client.
- This suggests a race, not deterministic failure of the job itself.

So the condition is probably looks like:
1. Some stale/missing/terminal allocation ID exists in the client/server allocation-watch path.
2. `notification-service-delete-data` creates a short-lived periodic allocation.
3. It finishes and is marked for GC.
4. Another allocation update wakes `Client.watchAllocations`.
5. Client asks `Alloc.GetAllocs` for a batch of allocation IDs.
6. Response contains a nil/missing allocation entry.
7. Client calls `Allocation.Canonicalize` and panics.

### Reproduction steps

I do not yet have a minimal standalone reproducer, but this is reproducible in my cluster.

The job is a periodic batch job. It creates a short-lived Docker task which exits successfully.

Simplified shape:

```hcl
job "notification-service-daily" {
  type = "batch"

  periodic {
    cron             = "0 0 * * * * *"
    prohibit_overlap = true
  }

  group "notification-service-delete-data" {
    task "notification-service-delete-data" {
      driver = "docker"

      config {
        image      = "..."
        entrypoint = ["php", "artisan", "data-delete"]
      }

      restart {
        attempts = 0
        mode     = "fail"
      }

      template {
        destination = "local/.env"
        change_mode = "noop"
      }

      template {
        destination = "local/svctl-healthcheck.sh"
        change_mode = "restart"
      }
    }
  }
}
```

On the affected client the task usually does:
- Received
- Task Setup
- Started
- Terminated: Exit Code 0
- not restarting task: Policy allows no restarts
- client.gc: marking allocation for GC

Then, in some timer windows, Nomad panics 30-60 seconds later.

### Expected Result

A successful short-lived periodic batch allocation should complete and be garbage-collected without crashing the Nomad client.

Even if an allocation disappears between `Node.GetClientAllocs` and `Alloc.GetAllocs`, the client should retry, ignore the missing allocation, or return an error, not panic.

### Actual Result

The Nomad client panics:

```text
panic: runtime error: invalid memory address or nil pointer dereference

github.com/hashicorp/nomad/nomad/structs.(*Allocation).Canonicalize
github.com/hashicorp/nomad/client.(*Client).watchAllocations
created by github.com/hashicorp/nomad/client.(*Client).run
```

The systemd service exits with: `status=2/INVALIDARGUMENT`

Because this node is also a Nomad server and runs many services, the crash causes service deregistration / DNS disruption until the agent restarts and re-syncs services.

### Job file (if appropriate)

The affected job is a periodic batch job. The task is a Docker task that runs a command and exits successfully.

Relevant properties:
- Job type: batch
- Periodic: enabled
- ProhibitOverlap: true
- Task exits quickly with exit code 0
- Restart policy: attempts=0, mode=fail
- Docker driver
- Templates rendered into local/.env and local/svctl-healthcheck.sh

I can provide the full sanitized job spec if needed.

### Nomad Server logs (if appropriate)

The affected process is both server and client. Around the crashes, the service exits with panic:

```text
Jun 24 10:01:09 App1 nomad[...] panic: runtime error: invalid memory address or nil pointer dereference
Jun 24 10:01:10 App1 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

Jun 24 11:01:09 App1 nomad[...] panic: runtime error: invalid memory address or nil pointer dereference
Jun 24 11:01:10 App1 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

Jun 24 13:00:48 App1 nomad[...] panic: runtime error: invalid memory address or nil pointer dereference
Jun 24 13:00:48 App1 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
```

### Nomad Client logs (if appropriate)

Crash case 1:

```text
2026-06-24T10:00:00.248+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
  task=notification-service-delete-data
  type=Received
  msg="Task received by client"

2026-06-24T10:00:04.831+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
  task=notification-service-delete-data
  type=Started
  msg="Task started by client"

2026-06-24T10:00:11.187+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
  task=notification-service-delete-data
  type=Terminated
  msg="Exit Code: 0"

2026-06-24T10:00:11.242+0300 [INFO] client.alloc_runner.task_runner:
  not restarting task:
  alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825
  task=notification-service-delete-data
  reason="Policy allows no restarts"

2026-06-24T10:00:11.268+0300 [INFO] client.gc:
  marking allocation for GC:
  alloc_id=6cb542ab-e4af-1e5c-9191-95cf15076825

2026-06-24T10:01:09+0300
  panic in Allocation.Canonicalize from Client.watchAllocations
```

Crash case 2:

```text
2026-06-24T11:00:00.180+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=734033aa-2b08-c669-99c3-f5ca0f3f7968
  task=notification-service-delete-data
  type=Received
  msg="Task received by client"

2026-06-24T11:00:05.329+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=734033aa-2b08-c669-99c3-f5ca0f3f7968
  task=notification-service-delete-data
  type=Started
  msg="Task started by client"

2026-06-24T11:00:12.886+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=734033aa-2b08-c669-99c3-f5ca0f3f7968
  task=notification-service-delete-data
  type=Terminated
  msg="Exit Code: 0"

2026-06-24T11:00:12.958+0300 [INFO] client.gc:
  marking allocation for GC:
  alloc_id=734033aa-2b08-c669-99c3-f5ca0f3f7968

2026-06-24T11:01:09+0300
  panic in Allocation.Canonicalize from Client.watchAllocations
```

Crash case 3, after upgrading to Nomad 2.0.3:

```text
2026-06-24T13:00:00.237+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=3fed2b0c-f8d5-f8fb-56fd-88821f85dbaa
  task=notification-service-delete-data
  type=Received
  msg="Task received by client"

2026-06-24T13:00:04.889+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=3fed2b0c-f8d5-f8fb-56fd-88821f85dbaa
  task=notification-service-delete-data
  type=Started
  msg="Task started by client"

2026-06-24T13:00:09.643+0300 [INFO] client.alloc_runner.task_runner:
  Task event: alloc_id=3fed2b0c-f8d5-f8fb-56fd-88821f85dbaa
  task=notification-service-delete-data
  type=Terminated
  msg="Exit Code: 0"

2026-06-24T13:00:09.706+0300 [INFO] client.gc:
  marking allocation for GC:
  alloc_id=3fed2b0c-f8d5-f8fb-56fd-88821f85dbaa

2026-06-24T13:00:48+0300
  panic in Allocation.Canonicalize from Client.watchAllocations
```

Counter-example: the same task completed many other times without panic, so this looks like a race:
```text
2026-06-23 01:00:05 notification-service-delete-data terminated and was GC-marked; no panic
2026-06-23 02:00:04 notification-service-delete-data terminated and was GC-marked; no panic
2026-06-24 01:00:04 notification-service-delete-data terminated and was GC-marked; no panic
2026-06-24 02:00:04 notification-service-delete-data terminated and was GC-marked; no panic
2026-06-24 03:00:04 notification-service-delete-data terminated and was GC-marked; no panic
```

Additional notes:
- Running the same job manually/on-demand away from the timer window did not reproduce the panic.
- I do not run external scheduled nomad system gc.
- I checked current allocations after the crash: all current App1 allocations returned HTTP 200 and had embedded Job objects, so I do not see persistent state corruption after the fact.
- The problematic state may exist only transiently inside the allocation watch / allocation fetch path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nomad client panic in `Allocation.Canonicalize` after short-lived periodic batch allocation completes #28172

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Nomad client panic in Allocation.Canonicalize after short-lived periodic batch allocation completes #28172

Description

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Nomad client panic in `Allocation.Canonicalize` after short-lived periodic batch allocation completes #28172