fix: QBFT by iamquang95 · Pull Request #417 · NethermindEth/pluto

iamquang95 · 2026-05-15T05:59:50Z

This PR

Hardens the QBFT tests. Makes the test parity with charon: Previously, some params are misconfigured.
Add fuzz tests for QBFT matches with charon.
Add some more unit tests which helps the fix QBFT stricter.
Add more chain split test.
Fix the existing QBFT implementation to pass tests.

iamquang95 · 2026-05-18T03:43:48Z

@claude review the changes

claude · 2026-05-18T03:44:02Z

Claude finished @iamquang95's task in 7m 6s —— View job

Claude PR Review — fix: QBFT

Read PR diff and understand the scope of changes
Review crates/core/src/qbft/mod.rs (core QBFT logic changes)
Review crates/core/src/qbft/fake_clock.rs (test clock changes)
Review crates/core/src/qbft/internal_test.rs (tests/fuzz)
Cross-check with Charon Go source for parity
Post review feedback

Summary

Solid hardening PR. Three real fixes (FakeClock Drop, thread::scope deadlock in compare, more correct value-source preservation), plus much stronger test coverage (compare semantics, chain-split, fuzzing, idle cancellation). I have no blocking concerns. A handful of questions and suggestions below — most are nits or things worth documenting.

Strengths

Drop for FakeClock removal is a real bug fix. Previously, because FakeClock is Clone with shared Arc<Mutex<…>>, Drop::drop ran on every clone going out of scope and indiscriminately cancelled all timers via inner.clients.clear(). Closures or threads that captured a clone would silently kill production timers in tests. Tests now always do explicit clock.cancel() at termination/panic sites, which is the right shape.
compare() now uses thread::spawn + a child CancellationTokenSource instead of thread::scope. The previous thread::scope blocked the entire run loop until the callback returned — so a buggy/blocking Compare could deadlock the algorithm even when the round timer fired. The new design lets the timer arm win and propagates cancellation to the callback via compare_cts. Good parity with Charon's context.WithCancel(ctx) pattern. (Tested directly in compare_timeout_does_not_wait_for_blocked_callback and compare_parent_cancel_cancels_callback_token.)
compare() drain-on-err is more correct than Charon. When compare_err_rx fires with Ok(()), the new code drains any pending compare_value_rx first (crates/core/src/qbft/mod.rs:660-662). Charon's select is non-deterministic and can return the old inputValueSource when the goroutine wrote both channels before the parent picked. The new fixture compare_success_error_cached_value_source_and_timeout at internal_test.rs:1442-1461 exercises this exact race. Worth a one-line code comment noting this is a deliberate divergence so the next reader doesn't "fix" it back.
panic!("bug: expected only …") → Err(QbftError::UnexpectedCompareError) (mod.rs:515). Returning an error instead of panicking is the right call inside a hot loop running under thread::scope. Good.
New fixtures fill real gaps: idle_run_returns_when_cancelled, the chain-split halt case (zz_chain_split_no_consensus_halt), the fuzzer tests, and the compare_* matrix all match Charon's test intent.

Issues & Questions

1. `compare()` busy-cancels after `ct.is_canceled()` (minor)

crates/core/src/qbft/mod.rs:687-691 — once the parent token is observed cancelled, the loop calls compare_cts.cancel() on every 1 ms tick until the callback finally drains. Idempotent but wasteful and a little noisy in profiling. Consider tracking a bool so cancel runs once, and dropping the default-arm to mpmc::never()-equivalent behavior afterwards. Not a correctness issue.

Fix this →

2. `qbft::run` returns `Ok(())` on cancellation, Charon returns `ctx.Err()` (parity question)

mod.rs:591-595 — the cancellation path simply breaks, then Ok(()). Charon's equivalent is case <-ctx.Done(): return ctx.Err() (charon/core/qbft/qbft.go:428). idle_run_returns_when_cancelled codifies the Pluto behaviour. Is the divergence intentional? Callers downstream might rely on observing the cancellation. If intentional, a one-line comment at the break would help.

3. `RUN_CANCELLATION_POLL_INTERVAL` is used by both `run` and `compare` (naming)

mod.rs:36 — the name says "run" but the constant is the poll cadence in both functions. Either generalize the name (e.g. CANCELLATION_POLL_INTERVAL) or define a separate constant for compare.

4. `Definition::compare` is `Arc<…>` while every other callback is `Box<dyn Fn …>` (API asymmetry)

mod.rs:121 — the asymmetry is justified (compare must clone into a spawned thread), but it's a footgun for users who already constructed a Box. Worth a short doc comment noting "Arc, because the callback is shared with a spawned worker thread."

5. `UnexpectedCompareError` is opaque

mod.rs:57-58 — it always means "compare returned a ChannelError or some other surprise." Consider UnexpectedCompareError(QbftError) so debugging logs carry the inner error. Optional.

6. Leader function leaves rounds with no leader at `round % N == 0`

internal_test.rs:600 — (instance + round) % n == process with processes 1..=N means round 4, 8, 12… have leader 0 (no one). For should_halt test this just wastes timeouts, but it's surprising. Charon's IsLeader test helper uses process == round%n with process 0-indexed; Pluto uses 1-indexed processes. If you want exact parity, switch to (instance + round - 1) % n + 1.

7. CodeQL alerts on `0x5142…` and `0x4348…` seeds

internal_test.rs:16, 586 — the latest commit ("removed hard coded salt in tests") suggests these were addressed, but both constants are still present in the file. CodeQL appears to be flagging the literal pattern. Two options: (a) mark the alerts as false-positives in the GitHub UI (they're test-only PRNG seeds, not crypto material), or (b) derive them at runtime (e.g. via std::process::id() or env var with deterministic override) so the literal pattern goes away. (a) is fine if the team accepts.

8. Spawned `compare` callback can leak the thread on a misbehaving callback

mod.rs:638-647 — if d.compare ignores the token and never writes to compare_err/compare_value, the spawned thread outlives compare(). Matches Charon's "leaked goroutine" semantics, so this is parity, but worth a doc note since thread::spawn doesn't carry the same intuition as a goroutine.

9. Test infra: `result_chan_tx` bounded N then re-sent N times

internal_test.rs:91 — capacity is N (4). With consensus, exactly 4 sends happen, so it fits. If a future change adds extra decide callbacks (e.g. retry/duplicate), this becomes a silent deadlock waiting for the receiver. Consider unbounded since this is a test-only channel.

10. Nit: `assert_eq!(vec![true, true], done_rx.try_iter().collect::<Vec<_>>())`

fake_clock.rs:126, 153 — ordering between the two sender threads isn't guaranteed; sorting before compare would be safer if the assertion ever expanded to non-uniform values. For true, true it's fine today.

Verdict

LGTM after addressing the cancellation-return-value parity question (#2) and the CodeQL alerts (#7). Everything else is optional polish. Nice incremental hardening of QBFT and a meaningful test surface improvement.
· Branch: iamquang95/qbft

varex83agent

Solid hardening pass on QBFT — the chain-split tests, fuzz cases, and Byzantine-input filters meaningfully tighten the implementation, and the compare refactor finally lets the round timer win over a blocked callback. A few items deserve attention before this fully matches Charon's robustness:

Minor (worth addressing):

compare callback exiting/panicking without sending on the err channel kills the QBFT instance (UnexpectedCompareError path). Charon Go quietly waits for the round timer instead — Pluto is strictly less resilient.
Each UPON_JUSTIFIED_PRE_PREPARE now spawns a detached OS thread. A callback that ignores the cancellation token leaks a thread per invocation; over hours this becomes a slow-burn DoS. (Documented as Charon parity, but OS threads aren't goroutines.)
thread::sleep(Duration::from_millis(10)) is used as test synchronization in compare_parent_cancel_cancels_callback_token — replace with a compare_started_rx signal like the sister test already does.
make_is_leader reshapes Go's modulo so process 4 wins at round 4 where Charon would elect no leader. Hidden test-only parity gap that may mask bugs in chain-split scenarios.
The new valid_round_change_prepared_round Byzantine check rejects messages Charon currently accepts — coordinate with the Charon team to add the same check upstream, or log a warning when it fires.

Plus several nits (naming, a dead filter, missing docs, missing boundary tests). Verdict: COMMENT — none of these block merge, but the compare-callback robustness deserves a follow-up.

varex83agent · 2026-05-18T12:45:11Z

@@ -98,56 +98,86 @@ impl FakeClock {
    }
 }


nit: the previous impl Drop for FakeClock auto-cancelled timers when the clock went out of scope. Removing it is correct for the new test pattern (FakeClock is Clone; threads hold their own clones, so one clone being dropped must not cancel the others), but no comment explains the design. Add a one-line comment near cancel documenting the explicit-cleanup expectation so a future contributor doesn't re-introduce Drop — and consider a test-helper macro that wraps the cts.cancel(); clock.cancel(); panic!(...) boilerplate, since every panic path in test_qbft / test_qbft_chain_split must remember to cancel both before panicking.

test: update qbft test

606d5bc

iamquang95 changed the title ~~fix: qbft~~ fix: QBFT May 15, 2026

github-advanced-security AI found potential problems May 15, 2026

View reviewed changes

iamquang95 added 2 commits May 18, 2026 10:22

fix: compare run on retached thread

0ec38de

fix: removed hard coded salt in tests

b752caa

iamquang95 added 7 commits May 18, 2026 12:52

fix: return error on Context cancelled

ed912ee

fix: hash from string, not magic number

e01f64f

fix: one shot cancel when parent is cancelled

b22fb41

fix: fix make_is_leader test

081d8a4

fix: minors naming and comments

1a0a337

Merge remote-tracking branch 'origin/main' into iamquang95/qbft

183f48e

fix: linter

44b9a13

iamquang95 marked this pull request as ready for review May 18, 2026 07:54

iamquang95 added 5 commits May 18, 2026 18:28

fix: early cancel on the loop

cc59baf

fix: context cancel in compare

ed82b6e

fix: validate definition

7911c23

fix: add check pr < r

829ba5f

fix: test and document on run

d834463

varex83agent reviewed May 18, 2026

View reviewed changes

iamquang95 added 9 commits May 19, 2026 11:04

fix: compare callback failed should timeout

739e3b3

fix: test use timeout channel instead of sleep

27744cf

fix: add comment on cancellation poll interval

0bd6a4f

fix: use enum for invalid defnintion error

e988c35

fix: add more test on check valid round

6c65b32

fix: using test-case

b30e616

fix: small fixes

4a364df

fix: linter

e759a81

Merge remote-tracking branch 'origin/main' into iamquang95/qbft

2d9ef3c

fix: remove unnecessary filter

998b9ce

iamquang95 requested review from emlautarom1, mskrzypkows, therustmonk and varex83 May 19, 2026 09:52

therustmonk approved these changes May 19, 2026

View reviewed changes

iamquang95 added 2 commits May 19, 2026 23:42

fix: make_is_leader now 0-based

0451f20

fix: add cancelled state for fake clock

e0e51f7

Conversation

iamquang95 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamquang95 commented May 18, 2026

Uh oh!

claude Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude PR Review — fix: QBFT

Summary

Strengths

Issues & Questions

1. compare() busy-cancels after ct.is_canceled() (minor)

2. qbft::run returns Ok(()) on cancellation, Charon returns ctx.Err() (parity question)

3. RUN_CANCELLATION_POLL_INTERVAL is used by both run and compare (naming)

4. Definition::compare is Arc<…> while every other callback is Box<dyn Fn …> (API asymmetry)

5. UnexpectedCompareError is opaque

6. Leader function leaves rounds with no leader at round % N == 0

7. CodeQL alerts on 0x5142… and 0x4348… seeds

8. Spawned compare callback can leak the thread on a misbehaving callback

9. Test infra: result_chan_tx bounded N then re-sent N times

10. Nit: assert_eq!(vec![true, true], done_rx.try_iter().collect::<Vec<_>>())

Verdict

Uh oh!

varex83agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

varex83agent May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iamquang95 commented May 15, 2026 •

edited

Loading

claude Bot commented May 18, 2026 •

edited

Loading

1. `compare()` busy-cancels after `ct.is_canceled()` (minor)

2. `qbft::run` returns `Ok(())` on cancellation, Charon returns `ctx.Err()` (parity question)

3. `RUN_CANCELLATION_POLL_INTERVAL` is used by both `run` and `compare` (naming)

4. `Definition::compare` is `Arc<…>` while every other callback is `Box<dyn Fn …>` (API asymmetry)

5. `UnexpectedCompareError` is opaque

6. Leader function leaves rounds with no leader at `round % N == 0`

7. CodeQL alerts on `0x5142…` and `0x4348…` seeds

8. Spawned `compare` callback can leak the thread on a misbehaving callback

9. Test infra: `result_chan_tx` bounded N then re-sent N times

10. Nit: `assert_eq!(vec![true, true], done_rx.try_iter().collect::<Vec<_>>())`