Skip to content

Retry interleaving: a failed retry job can remove_node the node a reconciled job just claimed #21

Description

@simons-plugins

Surfaced during the PR #20 review (timeout/reconcile analysis).

Scenario

  1. Job A commissions, times out → commissioning_timeout (deliberately no remove_node, per commission_jobs.py:383-398).
  2. User retries with the same/new setup code → job B starts.
  3. The original node finally joins → node_added fires → A's reconcile claims it (reconcile_node_added, commission_jobs.py:226) and creates Indigo devices.
  4. If job B meanwhile obtained a node_id from its own commission_with_code and then fails a later step (descriptor read, device create), B's _fail (commission_jobs.py:409-415) calls remove_node(node_id) — potentially tearing a live, just-claimed node (with Indigo devices) off the fabric.

Whether B's node_id can equal A's node depends on matter-server's behaviour when re-commissioning an already-joining device, but nothing in the job table prevents the destructive overlap.

Suggested work

  • At minimum a characterization test pinning current behaviour for the interleaving.
  • Possible fix: _fail should skip remove_node if any other job (terminal-success or reconciled) owns that node_id, or if Indigo devices already exist for it (device_sync lookup).

Origin: analysis in #20

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions