bgpd: cancel LLGR stale timer on peer AF delete#21947
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a use-after-free in bgpd where the LLGR stale timer (peer->t_llgr_stale[afi][safi]) is armed with a struct peer_af * callback argument that can be freed by peer_af_delete() before the timer fires. The fix cancels the pending timer when the peer AF is deleted.
Changes:
- Cancel
peer->t_llgr_stale[afi][safi]inpeer_af_delete()to prevent dereferencing a freedpeer_af. - Add a new topotest reproducing the crash scenario by deactivating an AF while the LLGR stale timer is pending.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| bgpd/bgpd.c | Cancels the pending LLGR stale timer in peer_af_delete() to prevent stale callback argument use. |
| tests/topotests/bgp_llgr_stale_timer_af_delete/test_bgp_llgr_stale_timer_af_delete.py | New topotest that arms the LLGR stale timer, deletes the peer AF, and verifies bgpd survives. |
| tests/topotests/bgp_llgr_stale_timer_af_delete/r1/frr.conf | R1 (restarting peer) BGP config with GR and LLGR stale time. |
| tests/topotests/bgp_llgr_stale_timer_af_delete/r2/frr.conf | R2 (helper peer) BGP config with GR and LLGR stale time. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR fixes a use-after-free crash in
Confidence Score: 4/5The one-line C fix is minimal, targeted, and correct; the only gap is the missing The C change is a single tests/topotests/bgp_llgr_stale_timer_af_delete/ — needs an empty Important Files Changed
Sequence DiagramsequenceDiagram
participant R1 as R1 bgpd
participant R2 as R2 bgpd (helper)
participant Timer as LLGR Stale Timer
participant PAF as peer_af (IPv4)
R1->>R2: BGP session established
R2->>PAF: peer_af_create(IPv4 unicast)
R1-->>R2: R1 bgpd stops (crash/kill)
R2->>Timer: event_add_timer(bgp_llgr_stale_timer_expire, paf)
Note over Timer,PAF: timer callback arg = struct peer_af*
alt BEFORE fix (bug)
R2->>PAF: no neighbor activate → peer_af_delete() → XFREE(paf)
Note over PAF: peer_af freed, timer still pending
Timer-->>PAF: timer fires, dereferences freed paf → CRASH
else AFTER fix (this PR)
R2->>PAF: no neighbor activate → peer_af_delete()
R2->>Timer: "event_cancel(&peer->t_llgr_stale[afi][safi])"
R2->>PAF: XFREE(paf)
Note over Timer: timer cancelled, never fires with stale pointer
end
Prompt To Fix All With AIFix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
tests/topotests/bgp_llgr_stale_timer_af_delete/test_bgp_llgr_stale_timer_af_delete.py:1
**Missing `__init__.py` in new topotest directory**
Every other topotest directory in the FRR tree includes an `__init__.py` (e.g. `bgp_evpn_gr`, `bgp_llgr`, `bgp_addpath_llgr`, etc.). Without it, FRR's test-runner infrastructure may fail to discover this test as part of the package, causing it to be silently skipped rather than executed.
Reviews (1): Last reviewed commit: "bgpd: cancel LLGR stale timer on peer AF..." | Re-trigger Greptile |
| @@ -0,0 +1,189 @@ | |||
| #!/usr/bin/env python | |||
There was a problem hiding this comment.
Missing
__init__.py in new topotest directory
Every other topotest directory in the FRR tree includes an __init__.py (e.g. bgp_evpn_gr, bgp_llgr, bgp_addpath_llgr, etc.). Without it, FRR's test-runner infrastructure may fail to discover this test as part of the package, causing it to be silently skipped rather than executed.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/bgp_llgr_stale_timer_af_delete/test_bgp_llgr_stale_timer_af_delete.py
Line: 1
Comment:
**Missing `__init__.py` in new topotest directory**
Every other topotest directory in the FRR tree includes an `__init__.py` (e.g. `bgp_evpn_gr`, `bgp_llgr`, `bgp_addpath_llgr`, etc.). Without it, FRR's test-runner infrastructure may fail to discover this test as part of the package, causing it to be silently skipped rather than executed.
How can I resolve this? If you propose a fix, please make it concise.When BGP GR helper mode arms an LLGR stale timer, the event callback argument is the struct peer_af for that AFI/SAFI. Deactivating the AF with no neighbor ... activate frees the peer_af in peer_af_delete(), but the stale timer could remain queued and later dereference the freed callback argument. Cancel peer->t_llgr_stale[afi][safi] while deleting the peer AF so no stale LLGR callback can run after the peer_af lifetime ends. Add a topotest that arms the timer, removes the AF before expiry, and verifies bgpd survives past the original timer deadline. Fixes FRRouting#21939 Signed-off-by: Z-Yivon <652025330042@smail.nju.edu.cn>
f01f0d2 to
b3ff6ed
Compare
|
@Mergifyio backport stable/10.6 stable/10.5 stable/10.4 stable/10.3 |
🟠 Waiting for conditions to matchDetails
|
Fixes #21939
This PR fixes a stale timer lifetime bug in
bgpdLLGR helper handling.When a peer enters Graceful Restart helper mode with Long-lived Graceful Restart
enabled,
bgp_graceful_restart_timer_expire()starts an LLGR stale timer forthe affected AFI/SAFI. That timer is stored in
peer->t_llgr_stale[afi][safi], but its callback argument is the correspondingstruct peer_af *.If the same address family is deactivated before the LLGR stale timer expires,
peer_af_delete()frees thatstruct peer_af. Without cancelling the pendingLLGR stale timer, the timer can later fire with a stale callback argument and
bgpdcan crash.Root Cause
The LLGR stale timer is armed with a
struct peer_af *callback argument:The timer callback dereferences that pointer:
However, address-family deactivation can delete the same
peer_afwhile thetimer is still pending:
That leaves
peer->t_llgr_stale[afi][safi]holding a callback argument whoselifetime has already ended.
Fix
Cancel the LLGR stale timer while deleting the corresponding peer AF:
bgp_soft_reconfig_table_task_cancel(bgp, bgp->rib[afi][safi], peer); bgp_stop_announce_route_timer(af); +event_cancel(&peer->t_llgr_stale[afi][safi]); if (PAF_SUBGRP(af)) {