Skip to content

bgpd: cancel LLGR stale timer on peer AF delete#21947

Open
Z-Yivon wants to merge 1 commit into
FRRouting:masterfrom
Z-Yivon:bgpd-llgr-stale-timer-af-delete
Open

bgpd: cancel LLGR stale timer on peer AF delete#21947
Z-Yivon wants to merge 1 commit into
FRRouting:masterfrom
Z-Yivon:bgpd-llgr-stale-timer-af-delete

Conversation

@Z-Yivon
Copy link
Copy Markdown

@Z-Yivon Z-Yivon commented May 14, 2026

Fixes #21939

This PR fixes a stale timer lifetime bug in bgpd LLGR helper handling.

When a peer enters Graceful Restart helper mode with Long-lived Graceful Restart
enabled, bgp_graceful_restart_timer_expire() starts an LLGR stale timer for
the affected AFI/SAFI. That timer is stored in
peer->t_llgr_stale[afi][safi], but its callback argument is the corresponding
struct peer_af *.

If the same address family is deactivated before the LLGR stale timer expires,
peer_af_delete() frees that struct peer_af. Without cancelling the pending
LLGR stale timer, the timer can later fire with a stale callback argument and
bgpd can crash.

Root Cause

The LLGR stale timer is armed with a struct peer_af * callback argument:

/* bgpd/bgp_fsm.c:bgp_graceful_restart_timer_expire() */
paf = peer_af_find(peer, afi, safi);
if (!paf)
	continue;

event_add_timer(bm->master, bgp_llgr_stale_timer_expire,
		paf, peer->llgr[afi][safi].stale_time,
		&peer->t_llgr_stale[afi][safi]);

The timer callback dereferences that pointer:

/* bgpd/bgp_fsm.c:bgp_llgr_stale_timer_expire() */
paf = EVENT_ARG(event);

peer = paf->peer;
afi = paf->afi;
safi = paf->safi;

However, address-family deactivation can delete the same peer_af while the
timer is still pending:

/* bgpd/bgpd.c:peer_af_delete() */
peer->peer_af_array[afid] = NULL;
XFREE(MTYPE_BGP_PEER_AF, af);

That leaves peer->t_llgr_stale[afi][safi] holding a callback argument whose
lifetime has already ended.

Fix

Cancel the LLGR stale timer while deleting the corresponding peer AF:

 bgp_soft_reconfig_table_task_cancel(bgp, bgp->rib[afi][safi], peer);

 bgp_stop_announce_route_timer(af);
+event_cancel(&peer->t_llgr_stale[afi][safi]);

 if (PAF_SUBGRP(af)) {

Copilot AI review requested due to automatic review settings May 14, 2026 06:17
@frrbot frrbot Bot added the bgp label May 14, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a use-after-free in bgpd where the LLGR stale timer (peer->t_llgr_stale[afi][safi]) is armed with a struct peer_af * callback argument that can be freed by peer_af_delete() before the timer fires. The fix cancels the pending timer when the peer AF is deleted.

Changes:

  • Cancel peer->t_llgr_stale[afi][safi] in peer_af_delete() to prevent dereferencing a freed peer_af.
  • Add a new topotest reproducing the crash scenario by deactivating an AF while the LLGR stale timer is pending.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
bgpd/bgpd.c Cancels the pending LLGR stale timer in peer_af_delete() to prevent stale callback argument use.
tests/topotests/bgp_llgr_stale_timer_af_delete/test_bgp_llgr_stale_timer_af_delete.py New topotest that arms the LLGR stale timer, deletes the peer AF, and verifies bgpd survives.
tests/topotests/bgp_llgr_stale_timer_af_delete/r1/frr.conf R1 (restarting peer) BGP config with GR and LLGR stale time.
tests/topotests/bgp_llgr_stale_timer_af_delete/r2/frr.conf R2 (helper peer) BGP config with GR and LLGR stale time.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR fixes a use-after-free crash in bgpd where the LLGR stale timer could fire with a freed struct peer_af * callback argument after the corresponding address family was deactivated via no neighbor ... activate.

  • bgpd/bgpd.c: Adds a single event_cancel(&peer->t_llgr_stale[afi][safi]) call inside peer_af_delete(), immediately after stopping the announce-route timer and before the peer_af struct is freed, ensuring the pending timer is disarmed before its callback argument's lifetime ends.
  • New topotest (bgp_llgr_stale_timer_af_delete): Regression test that arms the LLGR stale timer by killing R1's bgpd, then deactivates the IPv4 peer AF on R2, and verifies R2's bgpd remains alive past the original stale-timer deadline. The test directory is missing __init__.py, which is present in every other FRR topotest.

Confidence Score: 4/5

The one-line C fix is minimal, targeted, and correct; the only gap is the missing __init__.py in the new topotest directory.

The C change is a single event_cancel call that mirrors the pattern already used in bgp_clear_peer_on_afi_safi and bgp_fsm.c. The new topotest exercises exactly the failing scenario and is well-structured, but every other FRR topotest ships with __init__.py and this one does not, which could cause the regression test to be silently skipped by the test runner.

tests/topotests/bgp_llgr_stale_timer_af_delete/ — needs an empty __init__.py to match the convention of all other FRR topotest directories.

Important Files Changed

Filename Overview
bgpd/bgpd.c Adds event_cancel(&peer->t_llgr_stale[afi][safi]) in peer_af_delete() before the peer_af is freed, correctly preventing use-after-free when the LLGR stale timer fires after AF deactivation.
tests/topotests/bgp_llgr_stale_timer_af_delete/test_bgp_llgr_stale_timer_af_delete.py New regression topotest: establishes BGP, kills r1 bgpd to arm the LLGR stale timer, deactivates the peer AF on r2, and verifies bgpd survives past the original stale-timer deadline. Missing __init__.py may affect test discovery.
tests/topotests/bgp_llgr_stale_timer_af_delete/r1/frr.conf R1 FRR config enabling GR and LLGR with a 10-second stale time and fast timers (1s/3s) to accelerate test execution.
tests/topotests/bgp_llgr_stale_timer_af_delete/r2/frr.conf R2 FRR config mirroring the GR/LLGR setup; the helper router whose bgpd is monitored for the crash.

Sequence Diagram

sequenceDiagram
    participant R1 as R1 bgpd
    participant R2 as R2 bgpd (helper)
    participant Timer as LLGR Stale Timer
    participant PAF as peer_af (IPv4)

    R1->>R2: BGP session established
    R2->>PAF: peer_af_create(IPv4 unicast)
    R1-->>R2: R1 bgpd stops (crash/kill)
    R2->>Timer: event_add_timer(bgp_llgr_stale_timer_expire, paf)
    Note over Timer,PAF: timer callback arg = struct peer_af*

    alt BEFORE fix (bug)
        R2->>PAF: no neighbor activate → peer_af_delete() → XFREE(paf)
        Note over PAF: peer_af freed, timer still pending
        Timer-->>PAF: timer fires, dereferences freed paf → CRASH
    else AFTER fix (this PR)
        R2->>PAF: no neighbor activate → peer_af_delete()
        R2->>Timer: "event_cancel(&peer->t_llgr_stale[afi][safi])"
        R2->>PAF: XFREE(paf)
        Note over Timer: timer cancelled, never fires with stale pointer
    end
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
tests/topotests/bgp_llgr_stale_timer_af_delete/test_bgp_llgr_stale_timer_af_delete.py:1
**Missing `__init__.py` in new topotest directory**

Every other topotest directory in the FRR tree includes an `__init__.py` (e.g. `bgp_evpn_gr`, `bgp_llgr`, `bgp_addpath_llgr`, etc.). Without it, FRR's test-runner infrastructure may fail to discover this test as part of the package, causing it to be silently skipped rather than executed.

Reviews (1): Last reviewed commit: "bgpd: cancel LLGR stale timer on peer AF..." | Re-trigger Greptile

@@ -0,0 +1,189 @@
#!/usr/bin/env python
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing __init__.py in new topotest directory

Every other topotest directory in the FRR tree includes an __init__.py (e.g. bgp_evpn_gr, bgp_llgr, bgp_addpath_llgr, etc.). Without it, FRR's test-runner infrastructure may fail to discover this test as part of the package, causing it to be silently skipped rather than executed.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/bgp_llgr_stale_timer_af_delete/test_bgp_llgr_stale_timer_af_delete.py
Line: 1

Comment:
**Missing `__init__.py` in new topotest directory**

Every other topotest directory in the FRR tree includes an `__init__.py` (e.g. `bgp_evpn_gr`, `bgp_llgr`, `bgp_addpath_llgr`, etc.). Without it, FRR's test-runner infrastructure may fail to discover this test as part of the package, causing it to be silently skipped rather than executed.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this, @Z-Yivon.

When BGP GR helper mode arms an LLGR stale timer, the event callback argument is the struct peer_af for that AFI/SAFI. Deactivating the AF with no neighbor ... activate frees the peer_af in peer_af_delete(), but the stale timer could remain queued and later dereference the freed callback argument.

Cancel peer->t_llgr_stale[afi][safi] while deleting the peer AF so no stale LLGR callback can run after the peer_af lifetime ends. Add a topotest that arms the timer, removes the AF before expiry, and verifies bgpd survives past the original timer deadline.

Fixes FRRouting#21939

Signed-off-by: Z-Yivon <652025330042@smail.nju.edu.cn>
@Z-Yivon Z-Yivon force-pushed the bgpd-llgr-stale-timer-af-delete branch from f01f0d2 to b3ff6ed Compare May 14, 2026 06:22
@ton31337
Copy link
Copy Markdown
Member

@Mergifyio backport stable/10.6 stable/10.5 stable/10.4 stable/10.3

@mergify
Copy link
Copy Markdown

mergify Bot commented May 15, 2026

backport stable/10.6 stable/10.5 stable/10.4 stable/10.3

🟠 Waiting for conditions to match

Details
  • merged [📌 backport requirement]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bgpd: stale LLGR timer can crash after address-family deletion

3 participants