XDR opt experiment#5300
Draft
dmkozh wants to merge 50 commits into
Draft
Conversation
Replace xdrSha256(success) with streaming SHA256 calculation to avoid XDR re-serialization of InvokeHostFunctionSuccessPreImage. The return value and events are already available as XDR-encoded bytes, so we can hash them directly without round-trip serialization.
Adds parallel processing to transaction set handling: 1. Parallel TxFrame creation: Creates TxFrames from XDR envelopes in parallel during transaction set deserialization. Uses work-stealing via std::async with even distribution across available threads. 2. Parallel transaction validation: Validates transactions in parallel in txsAreValid() when there are 2+ transactions. 3. Hash precomputation: Precomputes content and full hashes before parallel operations to avoid race conditions. 4. Test coverage: Adds StreamingShaTest for InvokeHostFunctionSuccessPreImage verification. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add sizeBytes field to ContractDataMapEntryT to cache the XDR serialized size of ledger entries. This avoids repeated xdr_size() calls during state updates, reducing CPU overhead in the hot path. Also adds Tracy zone to updateState() for profiling visibility. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
During ledger close, three independent operations are now parallelized: - addHotArchiveBatch (modifies mHotArchiveBucketList) - addLiveBatch (modifies mLiveBucketList) - runs on main thread - updateInMemorySorobanState (modifies mInMemorySorobanState) These operations modify completely independent data structures and can safely run concurrently. Added getInMemorySorobanStateForUpdate() to allow direct access to mInMemorySorobanState during COMMITTING phase. This reduces ledger close latency by overlapping CPU-bound operations. # Conflicts: # src/ledger/LedgerManagerImpl.cpp
-5ms for 6400 SAC transfers scenario
libsodium uses a portable C SHA256 implementation, missing SHA-NI hardware instructions available on Intel Xeon Platinum. OpenSSL automatically uses SHA-NI, providing 4.6x speedup for streaming add() (893ns->193ns/call) and 56% total SHA256 self-time reduction (3,744ms->1,659ms per 30s trace). Use opaque aligned storage for SHA256_CTX in the header to avoid naming conflict between OpenSSL's ::SHA256 function and stellar::SHA256 class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…5ms/ledger) Run LiveBucketIndex construction on async worker thread in parallel with the put loop in mergeInMemory. Both read mergedEntries as const — fully independent. Tracy confirms full overlap: index future wait averages 2.2µs. finalizeLedgerTxnChanges drops from 164ms to 136ms per ledger. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When ledgerCloseMeta is null (meta tracking disabled), operate directly on the parent LTX in processFeesSeqNums and processPostTxSetApply instead of creating a child LTX per-transaction. The child LTX was only needed for getChanges() meta tracking. Saves ~41ms/ledger from eliminating ~10.6K child LTX create/commit cycles. Combined with experiment 011 (meta tracking), TPS improves from 10,688 to 12,736 (+19.2%). Also raises APPLY_LOAD_MAX_SAC_TPS_MAX_TPS from 12000 to 15000. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> # Conflicts: # docs/apply-load-max-sac-tps.cfg
In commitChangesToLedgerTxn, determining whether an entry is INIT (new) vs LIVE (existing) required calling mInMemorySorobanState.get() which computes sha256(xdr_to_opaque(key)) for every CONTRACT_DATA entry. With ~40K entries per ledger, this added ~16ms of SHA256 per ledger. Track existence via a bool mIsNew flag in ParallelApplyEntry, set when a TX creates an entry that didn't previously exist. This replaces the expensive SHA256-based existence check with a simple boolean. commitChangesToLedgerTxn: 72.6ms -> 44.2ms (-39%) TPS: 16,640 -> 16,960 (+1.9%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> # Conflicts: # src/transactions/ParallelApplyUtils.cpp
Add move overloads for createWithoutLoading/updateWithoutLoading and ScopedLedgerEntryOpt::moveFromScope to eliminate two deep copies per entry when committing parallel apply state to LedgerTxn. Reduces commitChangesToLedgerTxn from 44ms to 39ms per ledger (-12.8%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-load Soroban read-only entries (contract instance, code, TTL) into the global parallel apply state during setup, so per-TX lookups hit thread-local maps instead of traversing to InMemorySorobanState. Also cache protocol version and skip Soroban merge tracking in processFeesSeqNums, and use std::move for mLatestTxResultSet. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> # Conflicts: # docs/success/049-skip-child-ltx-processFeesSeqNums.md
Use bitset instead of maps and relax invariants a bit. This is pretty impactful - -10ms apply time for SAC, -20ms apply time for soroswap
Pre-compute expected entry counts from footprint sizes and call reserve() on ParallelApplyEntryMap containers before they accumulate entries. Eliminates log2(N) rehash operations during parallel apply, yielding -26% commitChangesFromThread and -27% commitChangesToLedgerTxn self-time. +576 TPS (+3.1%): 18,368 → 18,944 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> # Conflicts: # src/transactions/ParallelApplyUtils.cpp
resolveBackgroundEvictionScan previously received an UnorderedSet<LedgerKey> built by getAllKeysWithoutSealing() containing ~128K entries (~20ms to build), but only performed ~10-100 lookups. Added isModifiedKey() to LedgerTxn for direct O(1) lookups in the existing EntryMap, eliminating the set construction. resolveEviction zone: 20ms -> 0.116ms per ledger (99.4% reduction). TPS: 18,944 -> 19,328 avg (+2.0%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace single global mutex + RandomEvictionCache with 16 sharded caches, each with its own mutex. This eliminates contention when 4 parallel threads verify signatures simultaneously. Also use maybeGet() instead of exists()+get() double-lookup, fix ZoneText string heap allocations, make counters atomic, and remove unused liveSnapshot copy in applySorobanStageClustersInParallel.
Sort lightweight 24-byte EntryRef structs (type tag + pointer) instead of full BucketEntry objects (200-500 bytes) in convertToBucketEntry. Reduces sort swap cost by ~12x and materializes final vector in one cache-friendly sequential pass. Cuts convertToBucketEntry from 31.9ms to 25.4ms per ledger. Benchmark: 13,760 -> 14,144 TPS (+384 TPS, +2.8%)
Skip building LedgerTxnDelta in setEffectsDeltaFromSuccessfulTx when INVARIANT_CHECKS is empty. The delta is consumed exclusively by checkOnOperationApply which iterates an empty list when no invariants are configured. This eliminates ~285ms of shared_ptr allocations and entry copies across 4 worker threads per ledger. Benchmark: 12,736 -> 13,760 TPS (+1,024 TPS, +8.0%)
LedgerSnapshot was renamed to CheckValidLedgerViewWrapper and ApplyLedgerStateSnapshot to ApplyLedgerView in upstream's LedgerState refactor. Branch's parallel pre-apply paths used the old names; rename to match. ApplyLedgerView privately inherits from ImmutableLedgerView, so use executeWithMaybeInnerSnapshot to derive a CheckValidLedgerViewWrapper from it for the read-only pre-apply paths.
The previous adaptation used ApplyLedgerView::executeWithMaybeInnerSnapshot to derive a CheckValidLedgerViewWrapper, but ImmutableLedgerView (and therefore ApplyLedgerView via using-declaration) explicitly throws on that call. Instead, add a narrow accessor that hands out the underlying ImmutableLedgerView and use the existing CheckValidLedgerViewWrapper(ImmutableLedgerView const&) constructor.
The branch's parallel TxFrame creation paths only checked XDRProvidesValidFee() but missed the getInclusionFee() <= 0 check that upstream added in the sequential equivalents. Restore parity so generalized tx sets with negative-fee txs are rejected during construction.
- BucketBase.cpp: use auto const& instead of auto when retrieving entries from input iterators in mergeCasesWithDefaultAcceptance, avoiding one full BucketEntry deep copy per entry in the dominant merge path. - TxSetFrame.cpp: add missing std::move when initializing mStages from a named rvalue reference parameter, which silently copied the entire nested vector structure.
- writeOne now attempts an optimistic single-pass serialize when the buffer is already large enough, skipping the xdr_size() pre-traversal. Falls back to the two-pass path (with 2x headroom growth) on overflow. After warm-up, eliminates one full XDR tree traversal per entry written. - Add writeRaw() method that writes pre-serialized framed record bytes directly to the output stream, supporting SHA-256 hashing and byte counting. This enables raw byte passthrough in bucket merge.
During bucket merges, entries that pass through unchanged (not merged with a counterpart from the other input) are read, deserialized, then re-serialized on output. This commit adds infrastructure to retain the raw framed bytes from XDRInputFileStream::readOne() and pass them directly to the output stream via writeRaw(), bypassing serialization. Key changes: - XDRInputFileStream captures raw bytes during readOne() via mRawBuf - BucketInputIterator exposes moveRawBytes() to retrieve raw entry bytes - BucketOutputIterator gains putWithRaw() that buffers raw bytes alongside the deserialized entry, using writeRaw() on flush instead of writeOne() - BucketBase merge lambda uses address comparison to detect unmodified entries and routes them through the raw passthrough path
opaque_vec<N> is now xdr::shared_bytes<N> (a refcounted, copy-on-write
byte container) rather than std::vector<uint8_t>. A handful of call
sites assumed the std::vector representation and need adjustment:
- ByteSlice gains a templated ctor for shared_bytes<N>, matching the
existing opaque_array ctor.
- KeyFunctions<{PublicKey,SignerKey}>::getKeyValue construct std::vector
explicitly from the (now shared_bytes) xdr_to_opaque result.
- toCxxBuf builds the unique_ptr<vector<uint8_t>> from iterators since
the Rust bridge's CxxBuf type still wraps a std::vector.
- SharedModuleCacheCompiler stops using xvector<uint8_t> directly; switch
to opaque_vec<> so its API tracks the typedef.
- StreamingShaTest: same.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
readOne now reads each framed record into a per-record std::shared_ptr<uint8_t[]> buffer and decodes through the shared-buffer aware xdr_get archive. shared_bytes fields in the decoded value (opaque_vec, contract code, SCBytes payloads, etc.) become refcounted views into the read buffer instead of getting freshly allocated + memcpied from it. For records with a large opaque<> payload — Wasm uploads, contract data entries, large LedgerCloseMeta segments — this cuts the decode cost from "1 alloc + payload memcpy per opaque<> field" to "1 refcount bump + 16 bytes of bookkeeping per field." Buffer management: - If no outstanding references remain (mReadBuf.use_count() == 1) and the existing capacity is sufficient, the buffer is reused in place. - If the existing capacity is too small and the buffer is uniquely owned, it's reallocated with amortized doubling. - If outstanding shared_bytes views still reference the previous buffer (use_count > 1), a fresh buffer is allocated sized exactly to this record's payload so the old buffer can be released as soon as its dependents drop, avoiding capacity-pinning of large prior records by accumulated small ones (e.g. bucket index build). XDR-framed records are required to be 4-byte aligned by RFC 4506, so the xdr_get end-pointer alignment invariant is satisfied directly by \c sz and we can call \c g.done() like the previous implementation. Verified: full [bucket] suite (47 cases, 1.79M assertions) and full [soroban] suite (113 cases, 3.54M assertions) pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes on the Soroban host ↔ C++ XDR boundary: 1. toCxxBuf encodes directly into the destination std::vector<uint8_t> via the new xdr_to_vector helper. Previously the typedef flip from xvector<uint8_t> to xdr::shared_bytes left toCxxBuf paying an extra shared_bytes allocation + element-by-element copy on every encode, doubling the alloc/memcpy cost on the hot host-input encode path (LedgerEntry buffers for footprint, auth entries, host function arg, source ID, resources). The new path is a single resize + xdr encode pass into the vector that ends up in the CxxBuf. 2. The four xdr_from_opaque sites in InvokeHostFunctionOpFrame's output decoding (modified ledger entries, contract events, diagnostic events, host function return value) now go through xdrFromHostBytes, which wraps the rust::Vec<uint8_t> payload in a shared_ptr<uint8_t[]> once and decodes through the shared-buffer xdr_get. opaque<> fields in the decoded value alias into that single wrap buffer rather than each allocating their own. For LedgerEntries that include large Wasm/contract code blobs this collapses N per-field allocations into 1 wrap allocation. xdrFromHostBytes lives in TransactionUtils.h alongside toCxxBuf. Verified: full [bucket] (47 cases, 1.79M assertions) and full [soroban] (113 cases, 3.54M assertions) suites pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed_ptr alias) xdrFromHostBytes used to allocate a fresh \c shared_ptr<uint8_t[]> and memcpy the host-returned \c rust::Vec<uint8_t> payload into it before handing the decoder a shared backing buffer. Since the host-side buffer is dropped immediately after decode, the copy is wasted — the helper now moves \c rust::Vec into a heap-allocated holder and hands the decoder a \c shared_ptr aliased to the holder's bytes via the aliasing constructor. The decoded value's \c shared_bytes fields share the holder. Holder (and the Rust allocation it owns) is freed when the last refcount drops. For an N-byte host-returned payload this turns the wrap from "1 alloc + 1 memcpy of N" into "1 small alloc for the holder + control block (~48 B)" with no payload copy at all. Call sites in InvokeHostFunctionOpFrame switch to non-const iteration and move-in: - maybePopulateOutputDiagnosticEvents - recordStorageChanges - collectEvents - finalizeSuccess recordStorageChanges captures the encoded \c entrySize before moving \c buf.data into the decoder (it was previously read twice from buf.data.size() — once for the validation call and once for the metric counter — both reads happen post-move otherwise). The streaming SHA256 over the InvokeHostFunctionSuccessPreImage encoding used to run in finalizeSuccess and reads the original host-returned \c out.result_value.data + \c out.contract_events[].data bytes. Those bytes are now moved into the decoder by collectEvents/finalizeSuccess; we therefore hoist the hash up: it now runs in collectEvents before any moves and the result is stored in \c mSuccessPreImageHash for finalizeSuccess to consume. Same byte sequence, same hash, no extra buffer materialized. Verified: [ledgerclosemeta] suite passes; full \c make check is otherwise green (the day-seed-flaky binary-fuse-error-retry test was confirmed pre-existing in an earlier bisect). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.