Skip to content

feat(payment): fire trust events on provably-bad quote bindings (node-side audit)#97

Closed
grumbach wants to merge 1 commit into
WithAutonomi:mainfrom
grumbach:grumbach/node-side-quote-audit
Closed

feat(payment): fire trust events on provably-bad quote bindings (node-side audit)#97
grumbach wants to merge 1 commit into
WithAutonomi:mainfrom
grumbach:grumbach/node-side-quote-audit

Conversation

@grumbach
Copy link
Copy Markdown
Collaborator

Why

Prod fleet measurement (24 ant-node v0.11.3 hosts, 3h47m window 2026-05-14 23:15 → 2026-05-15 03:02 UTC): 4 distinct nodes (do-ams3-node-2, do-fra1-node-1, hz-hel1-2, hz-nbg1-3) reported BLAKE3 quote-binding mismatches.

The 2026-05-06 attack on world_trade.JPG: 5 distinct peer-IDs all signed quotes with one of 3 shared private keys (worst case: peer-IDs 5b45e9d7, 5b14976c, 5be5095b share BLAKE3 key 5bd35c6c...). Those 5 captured 5 of 15 close-K slots, forced quorum failure. Today the same operator IP (75.48.86.24) is being dialed 35,592 times across the prod fleet in the same 3h47m window.

Today validate_peer_bindings (verifier.rs:590) and the signature-verify loop (verifier.rs:466) detect the bad behaviour and return Err but neither calls report_trust_event. Result: detected, rejected, then forgotten — the same offender reappears in the next chunk's close-K and the cycle continues.

Per Mick's #05_client-side direction (2026-05-13): node-side quote audit, not client-side trust reporting. Anselme's PRs #114/#90/#77 (client side) were closed accordingly. This PR adds the node-side wiring without introducing a new audit protocol — the audit happens implicitly during payment verification.

What

Three trust-event sites, each with its own weight constant calibrated to the soundness of the attribution:

Site Weight Constant Rationale
BLAKE3 binding mismatch in validate_peer_bindings 5.0 QUOTE_BINDING_MISMATCH_WEIGHT Deterministic, non-spoofable. Payer cannot fabricate this against an innocent peer. Crosses BLOCK_THRESHOLD in one event so the offender becomes immediately ineligible for new admissions.
ML-DSA signature failure in verify_evm_payment 1.0 QUOTE_SIGNATURE_FAILURE_WEIGHT After binding has passed, signature failure proves only that the quote bytes do not verify under pub_key — does NOT prove the peer misbehaved. A malicious payer could flip a bit in quote.price after taking a valid quote, and the verifier would otherwise penalise the innocent peer at zero attacker cost. Lower weight degrades reputation under sustained patterns without single-event blocking.
Merkle candidate signature failure in verify_merkle_payment 5.0 MERKLE_CANDIDATE_SIGNATURE_FAILURE_WEIGHT Self-signed by the candidate node — no payer in between, attribution sound. Same rationale as binding mismatch.

Per-proof dedup: a peer appearing in multiple slots of the same proof produces at most one trust event per proof, capping trust-engine write-lock cost at O(distinct_offenders) regardless of attacker-controlled proof shape.

The "Invalid ML-DSA public key" branch (undecodable bytes) deliberately fires NO trust event. Corrupt bytes cannot be attributed to any real peer; penalising a random peer-ID derived from BLAKE3 of garbage would attack a non-existent identity.

Hooks into existing infrastructure

No new code paths, no new audit protocol, no wire-format changes:

  • PaymentVerifier already holds p2p_node: RwLock<Option<Arc<P2PNode>>> (verifier.rs:141)
  • attach_p2p_node() is already called at startup (verifier.rs:270)
  • P2PNode::report_trust_event(&PeerId, TrustEvent) is async and exists in saorsa-core (src/network.rs:1001)
  • TrustEvent::ApplicationFailure(f64) exists, weight clamped to MAX_CONSUMER_WEIGHT = 5.0
  • Trust scores feed Mick's lazy swap-out (saorsa-core PR feat!: require --enable-logging flag to install tracing subscriber #65 feat!: replace binary peer blocking with lazy trust-based swap-out): no immediate eviction, but the next time a better candidate competes for the same routing-table slot the bad peer is replaced.

API change

validate_peer_bindings is now an async fn(&self, ...) -> Result<()> (was static fn). Single caller updated.

The spawn_blocking signature-verify path now collects per-quote (EncodedPeerId, bool) results instead of bailing on first failure. Order preserved; first error returned matches pre-patch behaviour. The async caller iterates and fires one report per failure before returning.

Adversarial review

Spawned a hostile reviewer subagent before push. 3 high-severity findings, all addressed:

  1. Test did not assert multi-offender iteration. Rewrote validate_peer_bindings_does_not_short_circuit_on_first_valid_quote to put a CORRECTLY-bound quote at position 0; if the loop short-circuits on the first OK quote the test fails. Asserts the err names the first mismatched peer-id (proving iteration past position 0).

  2. DoS amplification on multi-offender proofs. Originally a 16-mismatch proof would fire 16 sequential report_trust_event calls (each a brief write-lock on the trust engine). Added per-proof dedup so 16 occurrences of the same offender produce 1 trust event, not 16.

  3. Signature-failure attribution could punish innocents. Originally used the max weight (5.0) for signature failures; a malicious payer could mutate a valid quote in transit and burn an innocent peer's trust at zero cost. Split the weight constant: signature failures use 1.0, binding mismatches keep 5.0.

Other lower-severity findings noted but not in this PR (e.g. failed-proof negative cache to dedup across proofs, splitting log levels for the report). They're additive — can be follow-ups without re-architecting this PR.

Tests

  • 3 new unit tests + existing test_wrong_peer_binding_rejected continues to pass
  • cargo test --lib — 463/463 pass
  • cargo fmt --all -- --check — clean
  • cargo clippy --all-features --all-targets -- -D warnings — clean
  • cargo clippy --all-features --lib -- -D clippy::panic -D clippy::unwrap_used -D clippy::expect_used — clean (matches CLAUDE.md's strict lint set)

What this does NOT do

  • No new wire protocol. Nothing changes for payers or for older nodes.
  • No immediate eviction. We feed the existing lazy-swap-out machinery; the offender stays in the routing table until a better candidate competes for the same slot.
  • No client-side reporting. Per Mick's direction, this is purely the node side.
  • Does not address the 35k/h dialing of 75.48.86.24 directly — that requires the ADD_ADDRESS reachability gate (separate PR in saorsa-transport, also in flight).

…-side audit)

Wires the existing detection paths in PaymentVerifier to saorsa-core's
TrustEvent system so provably-bad quote behaviour feeds Mick's lazy
swap-out (saorsa-core PR WithAutonomi#65) instead of being detected, rejected, then
forgotten.

Why
===

Prod fleet (24 ant-node v0.11.3 hosts, 3h47m window 2026-05-14 23:15 →
2026-05-15 03:02 UTC): 4 distinct nodes (do-ams3-node-2, do-fra1-node-1,
hz-hel1-2, hz-nbg1-3) reported BLAKE3 quote-binding mismatches. The
2026-05-06 attack: 5 distinct peer-IDs all signed quotes with one of 3
shared private keys (worst case: 3 peer-IDs share BLAKE3 key
5bd35c6c...), captured 5 of 15 close-K slots, forced quorum failure.

Today validate_peer_bindings (verifier.rs:590) and the signature-verify
loop (verifier.rs:466) both detect the bad behaviour and return Err but
neither calls report_trust_event, so the same offender reappears in the
next chunk's close-K and the cycle continues.

Per Mick's #05_client-side direction (2026-05-13): node-side quote audit,
not client-side trust reporting. PRs #114/WithAutonomi#90/WithAutonomi#77 (client side) were
closed accordingly. This PR adds the node-side wiring without introducing
a new audit protocol — the audit happens implicitly during payment
verification.

What
====

Three trust-event sites, each with its own weight constant:

- QUOTE_BINDING_MISMATCH_WEIGHT = 5.0 (max). BLAKE3 binding mismatch is
  deterministic and non-spoofable — payer cannot fabricate this against
  an innocent peer. Weight crosses BLOCK_THRESHOLD in one event so the
  offender becomes immediately ineligible for new admissions.

- QUOTE_SIGNATURE_FAILURE_WEIGHT = 1.0 (moderate). After binding has
  passed, signature failure proves only that the quote bytes do not
  verify under pub_key — does NOT prove the peer misbehaved. A malicious
  payer could flip a bit in quote.price after taking a valid quote,
  and the verifier would otherwise penalise the innocent peer at zero
  attacker cost. Lower weight degrades reputation under sustained
  patterns without single-event blocking.

- MERKLE_CANDIDATE_SIGNATURE_FAILURE_WEIGHT = 5.0 (max). Self-signed
  by the candidate node — no payer in between, attribution sound.

Per-proof dedup: a peer appearing in multiple slots produces at most one
trust event per proof, capping trust-engine write-lock cost at
O(distinct_offenders) regardless of attacker-controlled proof shape.

The 'Invalid ML-DSA public key' branch (undecodable bytes) deliberately
fires NO trust event — corrupt bytes cannot be attributed to any real
peer, and penalising a random peer-ID derived from BLAKE3 of garbage
would attack a non-existent identity.

Adversarial review (3 high-severity findings, all addressed)
============================================================

1. Test did not assert multi-offender iteration. Rewrote
   validate_peer_bindings_does_not_short_circuit_on_first_valid_quote
   to put a CORRECTLY-bound quote at position 0; if the loop short-
   circuits the test fails. Asserts the err names the first mismatched
   peer-id (proving iteration past position 0).

2. DoS amplification on multi-offender proofs. Added per-proof dedup
   so 16 occurrences of the same offender produce 1 trust event, not 16.

3. Signature-failure attribution can punish innocents (payer mutates
   bit-flipped quote in transit). Split the weight constant: signature
   failures use 1.0, binding mismatches use 5.0.

Tests
=====

3 new tests, 463 lib tests pass, fmt clean,
cargo clippy --all-features --all-targets -- -D warnings clean,
cargo clippy --all-features --lib -- -D clippy::panic -D clippy::unwrap_used
  -D clippy::expect_used -D warnings clean.

Existing test_wrong_peer_binding_rejected continues to pass — same Err
semantics, trust events fire on the way to the same error.

Tooling notes
=============

- validate_peer_bindings is now async on &self (was static fn). Single
  caller updated.
- Signature-verify spawn_blocking now collects (peer_id, valid) results
  instead of bailing on first failure. Order preserved; first error
  returned matches pre-patch behaviour.
- Same shape applied to verify_merkle_payment's candidate-signature loop.
Copilot AI review requested due to automatic review settings May 15, 2026 06:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Wires three node-side trust-event reports into PaymentVerifier so peers that ship provably-bad quotes (BLAKE3 binding mismatch, ML-DSA signature failure on a quote, or merkle-candidate signature failure) feed saorsa-core's lazy swap-out machinery instead of being silently rejected. Adds per-proof dedup, makes validate_peer_bindings async on &self, refactors the spawn_blocking signature path to collect per-quote results so the loop no longer short-circuits, and adds three unit tests around the binding path.

Changes:

  • New weight constants (QUOTE_BINDING_MISMATCH_WEIGHT = 5.0, QUOTE_SIGNATURE_FAILURE_WEIGHT = 1.0, MERKLE_CANDIDATE_SIGNATURE_FAILURE_WEIGHT = 5.0) and a report_peer_failure helper that calls P2PNode::report_trust_event when attached.
  • validate_peer_bindings becomes async fn(&self, …), walks the full proof, dedups offenders, and reports binding mismatches; the signature-verify path collects (EncodedPeerId, bool) and reports failures with dedup; verify_merkle_payment does the same for candidate-signature failures.
  • Three new #[tokio::test]s covering the no-short-circuit shape, no-panic when P2PNode is unattached, and the all-valid pass-through case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/payment/verifier.rs
Comment on lines 700 to 722
if expected_peer_id.as_bytes() != encoded_peer_id.as_bytes() {
let expected_hex = expected_peer_id.to_hex();
let actual_hex = hex::encode(encoded_peer_id.as_bytes());
return Err(Error::Payment(format!(
"Quote pub_key does not belong to claimed peer {encoded_peer_id:?}: \
BLAKE3(pub_key) = {expected_hex}, peer_id = {actual_hex}"
)));
// Provably bad behaviour: penalise the peer who claimed
// this binding. Use the EncodedPeerId from the proof —
// that is the identity routing-table lookups will hit.
let offender_bytes = *encoded_peer_id.as_bytes();
if reported.insert(offender_bytes) {
let offender = PeerId::from_bytes(offender_bytes);
self.report_peer_failure(
&offender,
QUOTE_BINDING_MISMATCH_WEIGHT,
"BLAKE3 quote-binding mismatch",
)
.await;
}
if first_error.is_none() {
first_error = Some(Error::Payment(format!(
"Quote pub_key does not belong to claimed peer {encoded_peer_id:?}: \
BLAKE3(pub_key) = {expected_hex}, peer_id = {actual_hex}"
)));
}
}
Comment thread src/payment/verifier.rs
Comment on lines +518 to 546
// produces at most one trust event. See validate_peer_bindings
// for the same rationale (cap write-lock cost on the trust
// engine regardless of attacker-controlled proof shape).
//
// Signature failures use a lower weight than binding mismatches
// (see [`QUOTE_SIGNATURE_FAILURE_WEIGHT`] doc): a malicious
// payer can flip a bit in `quote.price` after taking a valid
// quote from an honest peer, and the verifier would otherwise
// penalise the innocent peer at zero attacker cost.
let mut sig_error: Option<Error> = None;
let mut sig_reported: std::collections::HashSet<[u8; 32]> =
std::collections::HashSet::new();
for (encoded_peer_id, valid) in sig_results {
if !valid {
let offender_bytes = *encoded_peer_id.as_bytes();
if sig_reported.insert(offender_bytes) {
let offender = PeerId::from_bytes(offender_bytes);
self.report_peer_failure(
&offender,
QUOTE_SIGNATURE_FAILURE_WEIGHT,
"ML-DSA-65 signature verification failed",
)
.await;
}
if sig_error.is_none() {
sig_error = Some(Error::Payment(format!(
"Quote ML-DSA-65 signature verification failed for peer {encoded_peer_id:?}"
)));
}
Comment thread src/payment/verifier.rs
Comment on lines 1259 to +1269
if !crate::payment::verify_merkle_candidate_signature(candidate) {
return Err(Error::Payment(format!(
"Invalid ML-DSA-65 signature on merkle candidate node (reward: {})",
candidate.reward_address
)));
if let Ok(offender) = peer_id_from_public_key_bytes(&candidate.pub_key) {
if merkle_reported.insert(*offender.as_bytes()) {
self.report_peer_failure(
&offender,
MERKLE_CANDIDATE_SIGNATURE_FAILURE_WEIGHT,
"merkle candidate ML-DSA-65 signature verification failed",
)
.await;
}
}
@grumbach grumbach marked this pull request as draft May 15, 2026 09:35
@grumbach
Copy link
Copy Markdown
Collaborator Author

Closing: adversarial review found a design flaw. validate_peer_bindings attributes blame to encoded_peer_id, which is attacker-controlled proof bytes — a junk proof naming any honest peer + a random pub_key fires ApplicationFailure(5.0) against that peer at zero cost (no payment, no victim signature), before payment verification. This is a remote trust-poisoning primitive (downscore any peer ID below the swap threshold). Restarting from a new design: attribute only to bindings the quoted peer itself signed, and fire trust events only after on-chain payment verification.

@grumbach grumbach closed this May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants