feat(payment): fire trust events on provably-bad quote bindings (node-side audit)#97
feat(payment): fire trust events on provably-bad quote bindings (node-side audit)#97grumbach wants to merge 1 commit into
Conversation
…-side audit) Wires the existing detection paths in PaymentVerifier to saorsa-core's TrustEvent system so provably-bad quote behaviour feeds Mick's lazy swap-out (saorsa-core PR WithAutonomi#65) instead of being detected, rejected, then forgotten. Why === Prod fleet (24 ant-node v0.11.3 hosts, 3h47m window 2026-05-14 23:15 → 2026-05-15 03:02 UTC): 4 distinct nodes (do-ams3-node-2, do-fra1-node-1, hz-hel1-2, hz-nbg1-3) reported BLAKE3 quote-binding mismatches. The 2026-05-06 attack: 5 distinct peer-IDs all signed quotes with one of 3 shared private keys (worst case: 3 peer-IDs share BLAKE3 key 5bd35c6c...), captured 5 of 15 close-K slots, forced quorum failure. Today validate_peer_bindings (verifier.rs:590) and the signature-verify loop (verifier.rs:466) both detect the bad behaviour and return Err but neither calls report_trust_event, so the same offender reappears in the next chunk's close-K and the cycle continues. Per Mick's #05_client-side direction (2026-05-13): node-side quote audit, not client-side trust reporting. PRs #114/WithAutonomi#90/WithAutonomi#77 (client side) were closed accordingly. This PR adds the node-side wiring without introducing a new audit protocol — the audit happens implicitly during payment verification. What ==== Three trust-event sites, each with its own weight constant: - QUOTE_BINDING_MISMATCH_WEIGHT = 5.0 (max). BLAKE3 binding mismatch is deterministic and non-spoofable — payer cannot fabricate this against an innocent peer. Weight crosses BLOCK_THRESHOLD in one event so the offender becomes immediately ineligible for new admissions. - QUOTE_SIGNATURE_FAILURE_WEIGHT = 1.0 (moderate). After binding has passed, signature failure proves only that the quote bytes do not verify under pub_key — does NOT prove the peer misbehaved. A malicious payer could flip a bit in quote.price after taking a valid quote, and the verifier would otherwise penalise the innocent peer at zero attacker cost. Lower weight degrades reputation under sustained patterns without single-event blocking. - MERKLE_CANDIDATE_SIGNATURE_FAILURE_WEIGHT = 5.0 (max). Self-signed by the candidate node — no payer in between, attribution sound. Per-proof dedup: a peer appearing in multiple slots produces at most one trust event per proof, capping trust-engine write-lock cost at O(distinct_offenders) regardless of attacker-controlled proof shape. The 'Invalid ML-DSA public key' branch (undecodable bytes) deliberately fires NO trust event — corrupt bytes cannot be attributed to any real peer, and penalising a random peer-ID derived from BLAKE3 of garbage would attack a non-existent identity. Adversarial review (3 high-severity findings, all addressed) ============================================================ 1. Test did not assert multi-offender iteration. Rewrote validate_peer_bindings_does_not_short_circuit_on_first_valid_quote to put a CORRECTLY-bound quote at position 0; if the loop short- circuits the test fails. Asserts the err names the first mismatched peer-id (proving iteration past position 0). 2. DoS amplification on multi-offender proofs. Added per-proof dedup so 16 occurrences of the same offender produce 1 trust event, not 16. 3. Signature-failure attribution can punish innocents (payer mutates bit-flipped quote in transit). Split the weight constant: signature failures use 1.0, binding mismatches use 5.0. Tests ===== 3 new tests, 463 lib tests pass, fmt clean, cargo clippy --all-features --all-targets -- -D warnings clean, cargo clippy --all-features --lib -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used -D warnings clean. Existing test_wrong_peer_binding_rejected continues to pass — same Err semantics, trust events fire on the way to the same error. Tooling notes ============= - validate_peer_bindings is now async on &self (was static fn). Single caller updated. - Signature-verify spawn_blocking now collects (peer_id, valid) results instead of bailing on first failure. Order preserved; first error returned matches pre-patch behaviour. - Same shape applied to verify_merkle_payment's candidate-signature loop.
There was a problem hiding this comment.
Pull request overview
Wires three node-side trust-event reports into PaymentVerifier so peers that ship provably-bad quotes (BLAKE3 binding mismatch, ML-DSA signature failure on a quote, or merkle-candidate signature failure) feed saorsa-core's lazy swap-out machinery instead of being silently rejected. Adds per-proof dedup, makes validate_peer_bindings async on &self, refactors the spawn_blocking signature path to collect per-quote results so the loop no longer short-circuits, and adds three unit tests around the binding path.
Changes:
- New weight constants (
QUOTE_BINDING_MISMATCH_WEIGHT = 5.0,QUOTE_SIGNATURE_FAILURE_WEIGHT = 1.0,MERKLE_CANDIDATE_SIGNATURE_FAILURE_WEIGHT = 5.0) and areport_peer_failurehelper that callsP2PNode::report_trust_eventwhen attached. validate_peer_bindingsbecomesasync fn(&self, …), walks the full proof, dedups offenders, and reports binding mismatches; the signature-verify path collects(EncodedPeerId, bool)and reports failures with dedup;verify_merkle_paymentdoes the same for candidate-signature failures.- Three new
#[tokio::test]s covering the no-short-circuit shape, no-panic whenP2PNodeis unattached, and the all-valid pass-through case.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if expected_peer_id.as_bytes() != encoded_peer_id.as_bytes() { | ||
| let expected_hex = expected_peer_id.to_hex(); | ||
| let actual_hex = hex::encode(encoded_peer_id.as_bytes()); | ||
| return Err(Error::Payment(format!( | ||
| "Quote pub_key does not belong to claimed peer {encoded_peer_id:?}: \ | ||
| BLAKE3(pub_key) = {expected_hex}, peer_id = {actual_hex}" | ||
| ))); | ||
| // Provably bad behaviour: penalise the peer who claimed | ||
| // this binding. Use the EncodedPeerId from the proof — | ||
| // that is the identity routing-table lookups will hit. | ||
| let offender_bytes = *encoded_peer_id.as_bytes(); | ||
| if reported.insert(offender_bytes) { | ||
| let offender = PeerId::from_bytes(offender_bytes); | ||
| self.report_peer_failure( | ||
| &offender, | ||
| QUOTE_BINDING_MISMATCH_WEIGHT, | ||
| "BLAKE3 quote-binding mismatch", | ||
| ) | ||
| .await; | ||
| } | ||
| if first_error.is_none() { | ||
| first_error = Some(Error::Payment(format!( | ||
| "Quote pub_key does not belong to claimed peer {encoded_peer_id:?}: \ | ||
| BLAKE3(pub_key) = {expected_hex}, peer_id = {actual_hex}" | ||
| ))); | ||
| } | ||
| } |
| // produces at most one trust event. See validate_peer_bindings | ||
| // for the same rationale (cap write-lock cost on the trust | ||
| // engine regardless of attacker-controlled proof shape). | ||
| // | ||
| // Signature failures use a lower weight than binding mismatches | ||
| // (see [`QUOTE_SIGNATURE_FAILURE_WEIGHT`] doc): a malicious | ||
| // payer can flip a bit in `quote.price` after taking a valid | ||
| // quote from an honest peer, and the verifier would otherwise | ||
| // penalise the innocent peer at zero attacker cost. | ||
| let mut sig_error: Option<Error> = None; | ||
| let mut sig_reported: std::collections::HashSet<[u8; 32]> = | ||
| std::collections::HashSet::new(); | ||
| for (encoded_peer_id, valid) in sig_results { | ||
| if !valid { | ||
| let offender_bytes = *encoded_peer_id.as_bytes(); | ||
| if sig_reported.insert(offender_bytes) { | ||
| let offender = PeerId::from_bytes(offender_bytes); | ||
| self.report_peer_failure( | ||
| &offender, | ||
| QUOTE_SIGNATURE_FAILURE_WEIGHT, | ||
| "ML-DSA-65 signature verification failed", | ||
| ) | ||
| .await; | ||
| } | ||
| if sig_error.is_none() { | ||
| sig_error = Some(Error::Payment(format!( | ||
| "Quote ML-DSA-65 signature verification failed for peer {encoded_peer_id:?}" | ||
| ))); | ||
| } |
| if !crate::payment::verify_merkle_candidate_signature(candidate) { | ||
| return Err(Error::Payment(format!( | ||
| "Invalid ML-DSA-65 signature on merkle candidate node (reward: {})", | ||
| candidate.reward_address | ||
| ))); | ||
| if let Ok(offender) = peer_id_from_public_key_bytes(&candidate.pub_key) { | ||
| if merkle_reported.insert(*offender.as_bytes()) { | ||
| self.report_peer_failure( | ||
| &offender, | ||
| MERKLE_CANDIDATE_SIGNATURE_FAILURE_WEIGHT, | ||
| "merkle candidate ML-DSA-65 signature verification failed", | ||
| ) | ||
| .await; | ||
| } | ||
| } |
|
Closing: adversarial review found a design flaw. validate_peer_bindings attributes blame to encoded_peer_id, which is attacker-controlled proof bytes — a junk proof naming any honest peer + a random pub_key fires ApplicationFailure(5.0) against that peer at zero cost (no payment, no victim signature), before payment verification. This is a remote trust-poisoning primitive (downscore any peer ID below the swap threshold). Restarting from a new design: attribute only to bindings the quoted peer itself signed, and fire trust events only after on-chain payment verification. |
Why
Prod fleet measurement (24 ant-node v0.11.3 hosts, 3h47m window 2026-05-14 23:15 → 2026-05-15 03:02 UTC): 4 distinct nodes (
do-ams3-node-2,do-fra1-node-1,hz-hel1-2,hz-nbg1-3) reported BLAKE3 quote-binding mismatches.The 2026-05-06 attack on
world_trade.JPG: 5 distinct peer-IDs all signed quotes with one of 3 shared private keys (worst case: peer-IDs5b45e9d7,5b14976c,5be5095bshare BLAKE3 key5bd35c6c...). Those 5 captured 5 of 15 close-K slots, forced quorum failure. Today the same operator IP (75.48.86.24) is being dialed 35,592 times across the prod fleet in the same 3h47m window.Today
validate_peer_bindings(verifier.rs:590) and the signature-verify loop (verifier.rs:466) detect the bad behaviour and return Err but neither callsreport_trust_event. Result: detected, rejected, then forgotten — the same offender reappears in the next chunk's close-K and the cycle continues.Per Mick's #05_client-side direction (2026-05-13): node-side quote audit, not client-side trust reporting. Anselme's PRs #114/#90/#77 (client side) were closed accordingly. This PR adds the node-side wiring without introducing a new audit protocol — the audit happens implicitly during payment verification.
What
Three trust-event sites, each with its own weight constant calibrated to the soundness of the attribution:
validate_peer_bindingsQUOTE_BINDING_MISMATCH_WEIGHTBLOCK_THRESHOLDin one event so the offender becomes immediately ineligible for new admissions.verify_evm_paymentQUOTE_SIGNATURE_FAILURE_WEIGHTpub_key— does NOT prove the peer misbehaved. A malicious payer could flip a bit inquote.priceafter taking a valid quote, and the verifier would otherwise penalise the innocent peer at zero attacker cost. Lower weight degrades reputation under sustained patterns without single-event blocking.verify_merkle_paymentMERKLE_CANDIDATE_SIGNATURE_FAILURE_WEIGHTPer-proof dedup: a peer appearing in multiple slots of the same proof produces at most one trust event per proof, capping trust-engine write-lock cost at
O(distinct_offenders)regardless of attacker-controlled proof shape.The "Invalid ML-DSA public key" branch (undecodable bytes) deliberately fires NO trust event. Corrupt bytes cannot be attributed to any real peer; penalising a random peer-ID derived from BLAKE3 of garbage would attack a non-existent identity.
Hooks into existing infrastructure
No new code paths, no new audit protocol, no wire-format changes:
PaymentVerifieralready holdsp2p_node: RwLock<Option<Arc<P2PNode>>>(verifier.rs:141)attach_p2p_node()is already called at startup (verifier.rs:270)P2PNode::report_trust_event(&PeerId, TrustEvent)is async and exists in saorsa-core (src/network.rs:1001)TrustEvent::ApplicationFailure(f64)exists, weight clamped toMAX_CONSUMER_WEIGHT = 5.0feat!: replace binary peer blocking with lazy trust-based swap-out): no immediate eviction, but the next time a better candidate competes for the same routing-table slot the bad peer is replaced.API change
validate_peer_bindingsis now anasync fn(&self, ...) -> Result<()>(was staticfn). Single caller updated.The
spawn_blockingsignature-verify path now collects per-quote(EncodedPeerId, bool)results instead of bailing on first failure. Order preserved; first error returned matches pre-patch behaviour. The async caller iterates and fires one report per failure before returning.Adversarial review
Spawned a hostile reviewer subagent before push. 3 high-severity findings, all addressed:
Test did not assert multi-offender iteration. Rewrote
validate_peer_bindings_does_not_short_circuit_on_first_valid_quoteto put a CORRECTLY-bound quote at position 0; if the loop short-circuits on the first OK quote the test fails. Asserts the err names the first mismatched peer-id (proving iteration past position 0).DoS amplification on multi-offender proofs. Originally a 16-mismatch proof would fire 16 sequential
report_trust_eventcalls (each a brief write-lock on the trust engine). Added per-proof dedup so 16 occurrences of the same offender produce 1 trust event, not 16.Signature-failure attribution could punish innocents. Originally used the max weight (5.0) for signature failures; a malicious payer could mutate a valid quote in transit and burn an innocent peer's trust at zero cost. Split the weight constant: signature failures use 1.0, binding mismatches keep 5.0.
Other lower-severity findings noted but not in this PR (e.g. failed-proof negative cache to dedup across proofs, splitting log levels for the report). They're additive — can be follow-ups without re-architecting this PR.
Tests
test_wrong_peer_binding_rejectedcontinues to passcargo test --lib— 463/463 passcargo fmt --all -- --check— cleancargo clippy --all-features --all-targets -- -D warnings— cleancargo clippy --all-features --lib -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used— clean (matches CLAUDE.md's strict lint set)What this does NOT do