Skip to content

Adjust the minimum reschedule delay for failed allocations #28185

Description

@arodd

Today when a task process exits, or the health check fails, we will attempt to restart the task up to the maximum number of restart attempts configured. Once these attempts have been exhausted the reschedule block is triggered in order to reschedule the entire allocation to a new client node. Today the lower limit for the delay in rescheduling is set at 5 seconds and will back off or increase for subsequent attempts based on the delay function configured. https://developer.hashicorp.com/nomad/docs/job-specification/reschedule#delay

In cases where the rescheduling attempts are unlimited, and the task has another underlying issue that will prevent it from starting even on a different client, the settings in the reschedule block for delay/delay_function allow you to avoid rapidly attempting to reschedule tasks and thrashing repeated attempts. However, in certain scenarios a minimum 5 second delay may introduce an unwanted delay in the initial rescheduling process.

With additional documentation calling out the risks of runaway rescheduling events, we should allow adjusting this minimum lower than 5 seconds to reduce the recovery window for tasks that would otherwise start successfully on new clients.

Currently this is controlled via an internal constant https://github.com/hashicorp/nomad/blob/v2.0.3/nomad/structs/structs.go#L6606
https://github.com/hashicorp/nomad/blob/v2.0.3/nomad/structs/structs.go#L6671
https://github.com/hashicorp/nomad/blob/v2.0.3/nomad/structs/structs.go#L4732

We should consider allowing this to be set as low as 1 second while including warnings around unbounded rescheduling attempts.

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Feature.

    Projects

    Status
    Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions