XDR opt experiment by dmkozh · Pull Request #5300 · stellar/stellar-core

dmkozh · 2026-05-29T18:33:05Z

No description provided.

Replace xdrSha256(success) with streaming SHA256 calculation to avoid XDR re-serialization of InvokeHostFunctionSuccessPreImage. The return value and events are already available as XDR-encoded bytes, so we can hash them directly without round-trip serialization.

Adds parallel processing to transaction set handling: 1. Parallel TxFrame creation: Creates TxFrames from XDR envelopes in parallel during transaction set deserialization. Uses work-stealing via std::async with even distribution across available threads. 2. Parallel transaction validation: Validates transactions in parallel in txsAreValid() when there are 2+ transactions. 3. Hash precomputation: Precomputes content and full hashes before parallel operations to avoid race conditions. 4. Test coverage: Adds StreamingShaTest for InvokeHostFunctionSuccessPreImage verification. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add sizeBytes field to ContractDataMapEntryT to cache the XDR serialized size of ledger entries. This avoids repeated xdr_size() calls during state updates, reducing CPU overhead in the hot path. Also adds Tracy zone to updateState() for profiling visibility. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

During ledger close, three independent operations are now parallelized: - addHotArchiveBatch (modifies mHotArchiveBucketList) - addLiveBatch (modifies mLiveBucketList) - runs on main thread - updateInMemorySorobanState (modifies mInMemorySorobanState) These operations modify completely independent data structures and can safely run concurrently. Added getInMemorySorobanStateForUpdate() to allow direct access to mInMemorySorobanState during COMMITTING phase. This reduces ledger close latency by overlapping CPU-bound operations. # Conflicts: # src/ledger/LedgerManagerImpl.cpp

-5ms for 6400 SAC transfers scenario

libsodium uses a portable C SHA256 implementation, missing SHA-NI hardware instructions available on Intel Xeon Platinum. OpenSSL automatically uses SHA-NI, providing 4.6x speedup for streaming add() (893ns->193ns/call) and 56% total SHA256 self-time reduction (3,744ms->1,659ms per 30s trace). Use opaque aligned storage for SHA256_CTX in the header to avoid naming conflict between OpenSSL's ::SHA256 function and stellar::SHA256 class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…5ms/ledger) Run LiveBucketIndex construction on async worker thread in parallel with the put loop in mergeInMemory. Both read mergedEntries as const — fully independent. Tracy confirms full overlap: index future wait averages 2.2µs. finalizeLedgerTxnChanges drops from 164ms to 136ms per ledger. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When ledgerCloseMeta is null (meta tracking disabled), operate directly on the parent LTX in processFeesSeqNums and processPostTxSetApply instead of creating a child LTX per-transaction. The child LTX was only needed for getChanges() meta tracking. Saves ~41ms/ledger from eliminating ~10.6K child LTX create/commit cycles. Combined with experiment 011 (meta tracking), TPS improves from 10,688 to 12,736 (+19.2%). Also raises APPLY_LOAD_MAX_SAC_TPS_MAX_TPS from 12000 to 15000. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> # Conflicts: # docs/apply-load-max-sac-tps.cfg

In commitChangesToLedgerTxn, determining whether an entry is INIT (new) vs LIVE (existing) required calling mInMemorySorobanState.get() which computes sha256(xdr_to_opaque(key)) for every CONTRACT_DATA entry. With ~40K entries per ledger, this added ~16ms of SHA256 per ledger. Track existence via a bool mIsNew flag in ParallelApplyEntry, set when a TX creates an entry that didn't previously exist. This replaces the expensive SHA256-based existence check with a simple boolean. commitChangesToLedgerTxn: 72.6ms -> 44.2ms (-39%) TPS: 16,640 -> 16,960 (+1.9%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> # Conflicts: # src/transactions/ParallelApplyUtils.cpp

Add move overloads for createWithoutLoading/updateWithoutLoading and ScopedLedgerEntryOpt::moveFromScope to eliminate two deep copies per entry when committing parallel apply state to LedgerTxn. Reduces commitChangesToLedgerTxn from 44ms to 39ms per ledger (-12.8%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pre-load Soroban read-only entries (contract instance, code, TTL) into the global parallel apply state during setup, so per-TX lookups hit thread-local maps instead of traversing to InMemorySorobanState. Also cache protocol version and skip Soroban merge tracking in processFeesSeqNums, and use std::move for mLatestTxResultSet. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> # Conflicts: # docs/success/049-skip-child-ltx-processFeesSeqNums.md

Use bitset instead of maps and relax invariants a bit. This is pretty impactful - -10ms apply time for SAC, -20ms apply time for soroswap

Pre-compute expected entry counts from footprint sizes and call reserve() on ParallelApplyEntryMap containers before they accumulate entries. Eliminates log2(N) rehash operations during parallel apply, yielding -26% commitChangesFromThread and -27% commitChangesToLedgerTxn self-time. +576 TPS (+3.1%): 18,368 → 18,944 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> # Conflicts: # src/transactions/ParallelApplyUtils.cpp

resolveBackgroundEvictionScan previously received an UnorderedSet<LedgerKey> built by getAllKeysWithoutSealing() containing ~128K entries (~20ms to build), but only performed ~10-100 lookups. Added isModifiedKey() to LedgerTxn for direct O(1) lookups in the existing EntryMap, eliminating the set construction. resolveEviction zone: 20ms -> 0.116ms per ledger (99.4% reduction). TPS: 18,944 -> 19,328 avg (+2.0%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace single global mutex + RandomEvictionCache with 16 sharded caches, each with its own mutex. This eliminates contention when 4 parallel threads verify signatures simultaneously. Also use maybeGet() instead of exists()+get() double-lookup, fix ZoneText string heap allocations, make counters atomic, and remove unused liveSnapshot copy in applySorobanStageClustersInParallel.

Sort lightweight 24-byte EntryRef structs (type tag + pointer) instead of full BucketEntry objects (200-500 bytes) in convertToBucketEntry. Reduces sort swap cost by ~12x and materializes final vector in one cache-friendly sequential pass. Cuts convertToBucketEntry from 31.9ms to 25.4ms per ledger. Benchmark: 13,760 -> 14,144 TPS (+384 TPS, +2.8%)

Skip building LedgerTxnDelta in setEffectsDeltaFromSuccessfulTx when INVARIANT_CHECKS is empty. The delta is consumed exclusively by checkOnOperationApply which iterates an empty list when no invariants are configured. This eliminates ~285ms of shared_ptr allocations and entry copies across 4 worker threads per ledger. Benchmark: 12,736 -> 13,760 TPS (+1,024 TPS, +8.0%)

…ol version

LedgerSnapshot was renamed to CheckValidLedgerViewWrapper and ApplyLedgerStateSnapshot to ApplyLedgerView in upstream's LedgerState refactor. Branch's parallel pre-apply paths used the old names; rename to match. ApplyLedgerView privately inherits from ImmutableLedgerView, so use executeWithMaybeInnerSnapshot to derive a CheckValidLedgerViewWrapper from it for the read-only pre-apply paths.

The previous adaptation used ApplyLedgerView::executeWithMaybeInnerSnapshot to derive a CheckValidLedgerViewWrapper, but ImmutableLedgerView (and therefore ApplyLedgerView via using-declaration) explicitly throws on that call. Instead, add a narrow accessor that hands out the underlying ImmutableLedgerView and use the existing CheckValidLedgerViewWrapper(ImmutableLedgerView const&) constructor.

The branch's parallel TxFrame creation paths only checked XDRProvidesValidFee() but missed the getInclusionFee() <= 0 check that upstream added in the sequential equivalents. Restore parity so generalized tx sets with negative-fee txs are rejected during construction.

- BucketBase.cpp: use auto const& instead of auto when retrieving entries from input iterators in mergeCasesWithDefaultAcceptance, avoiding one full BucketEntry deep copy per entry in the dominant merge path. - TxSetFrame.cpp: add missing std::move when initializing mStages from a named rvalue reference parameter, which silently copied the entire nested vector structure.

- writeOne now attempts an optimistic single-pass serialize when the buffer is already large enough, skipping the xdr_size() pre-traversal. Falls back to the two-pass path (with 2x headroom growth) on overflow. After warm-up, eliminates one full XDR tree traversal per entry written. - Add writeRaw() method that writes pre-serialized framed record bytes directly to the output stream, supporting SHA-256 hashing and byte counting. This enables raw byte passthrough in bucket merge.

During bucket merges, entries that pass through unchanged (not merged with a counterpart from the other input) are read, deserialized, then re-serialized on output. This commit adds infrastructure to retain the raw framed bytes from XDRInputFileStream::readOne() and pass them directly to the output stream via writeRaw(), bypassing serialization. Key changes: - XDRInputFileStream captures raw bytes during readOne() via mRawBuf - BucketInputIterator exposes moveRawBytes() to retrieve raw entry bytes - BucketOutputIterator gains putWithRaw() that buffers raw bytes alongside the deserialized entry, using writeRaw() on flush instead of writeOne() - BucketBase merge lambda uses address comparison to detect unmodified entries and routes them through the raw passthrough path

opaque_vec<N> is now xdr::shared_bytes<N> (a refcounted, copy-on-write byte container) rather than std::vector<uint8_t>. A handful of call sites assumed the std::vector representation and need adjustment: - ByteSlice gains a templated ctor for shared_bytes<N>, matching the existing opaque_array ctor. - KeyFunctions<{PublicKey,SignerKey}>::getKeyValue construct std::vector explicitly from the (now shared_bytes) xdr_to_opaque result. - toCxxBuf builds the unique_ptr<vector<uint8_t>> from iterators since the Rust bridge's CxxBuf type still wraps a std::vector. - SharedModuleCacheCompiler stops using xvector<uint8_t> directly; switch to opaque_vec<> so its API tracks the typedef. - StreamingShaTest: same. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

readOne now reads each framed record into a per-record std::shared_ptr<uint8_t[]> buffer and decodes through the shared-buffer aware xdr_get archive. shared_bytes fields in the decoded value (opaque_vec, contract code, SCBytes payloads, etc.) become refcounted views into the read buffer instead of getting freshly allocated + memcpied from it. For records with a large opaque<> payload — Wasm uploads, contract data entries, large LedgerCloseMeta segments — this cuts the decode cost from "1 alloc + payload memcpy per opaque<> field" to "1 refcount bump + 16 bytes of bookkeeping per field." Buffer management: - If no outstanding references remain (mReadBuf.use_count() == 1) and the existing capacity is sufficient, the buffer is reused in place. - If the existing capacity is too small and the buffer is uniquely owned, it's reallocated with amortized doubling. - If outstanding shared_bytes views still reference the previous buffer (use_count > 1), a fresh buffer is allocated sized exactly to this record's payload so the old buffer can be released as soon as its dependents drop, avoiding capacity-pinning of large prior records by accumulated small ones (e.g. bucket index build). XDR-framed records are required to be 4-byte aligned by RFC 4506, so the xdr_get end-pointer alignment invariant is satisfied directly by \c sz and we can call \c g.done() like the previous implementation. Verified: full [bucket] suite (47 cases, 1.79M assertions) and full [soroban] suite (113 cases, 3.54M assertions) pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two changes on the Soroban host ↔ C++ XDR boundary: 1. toCxxBuf encodes directly into the destination std::vector<uint8_t> via the new xdr_to_vector helper. Previously the typedef flip from xvector<uint8_t> to xdr::shared_bytes left toCxxBuf paying an extra shared_bytes allocation + element-by-element copy on every encode, doubling the alloc/memcpy cost on the hot host-input encode path (LedgerEntry buffers for footprint, auth entries, host function arg, source ID, resources). The new path is a single resize + xdr encode pass into the vector that ends up in the CxxBuf. 2. The four xdr_from_opaque sites in InvokeHostFunctionOpFrame's output decoding (modified ledger entries, contract events, diagnostic events, host function return value) now go through xdrFromHostBytes, which wraps the rust::Vec<uint8_t> payload in a shared_ptr<uint8_t[]> once and decodes through the shared-buffer xdr_get. opaque<> fields in the decoded value alias into that single wrap buffer rather than each allocating their own. For LedgerEntries that include large Wasm/contract code blobs this collapses N per-field allocations into 1 wrap allocation. xdrFromHostBytes lives in TransactionUtils.h alongside toCxxBuf. Verified: full [bucket] (47 cases, 1.79M assertions) and full [soroban] (113 cases, 3.54M assertions) suites pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed_ptr alias) xdrFromHostBytes used to allocate a fresh \c shared_ptr<uint8_t[]> and memcpy the host-returned \c rust::Vec<uint8_t> payload into it before handing the decoder a shared backing buffer. Since the host-side buffer is dropped immediately after decode, the copy is wasted — the helper now moves \c rust::Vec into a heap-allocated holder and hands the decoder a \c shared_ptr aliased to the holder's bytes via the aliasing constructor. The decoded value's \c shared_bytes fields share the holder. Holder (and the Rust allocation it owns) is freed when the last refcount drops. For an N-byte host-returned payload this turns the wrap from "1 alloc + 1 memcpy of N" into "1 small alloc for the holder + control block (~48 B)" with no payload copy at all. Call sites in InvokeHostFunctionOpFrame switch to non-const iteration and move-in: - maybePopulateOutputDiagnosticEvents - recordStorageChanges - collectEvents - finalizeSuccess recordStorageChanges captures the encoded \c entrySize before moving \c buf.data into the decoder (it was previously read twice from buf.data.size() — once for the validation call and once for the metric counter — both reads happen post-move otherwise). The streaming SHA256 over the InvokeHostFunctionSuccessPreImage encoding used to run in finalizeSuccess and reads the original host-returned \c out.result_value.data + \c out.contract_events[].data bytes. Those bytes are now moved into the decoder by collectEvents/finalizeSuccess; we therefore hoist the hash up: it now runs in collectEvents before any moves and the result is stored in \c mSuccessPreImageHash for finalizeSuccess to consume. Same byte sequence, same hash, no extra buffer materialized. Verified: [ledgerclosemeta] suite passes; full \c make check is otherwise green (the day-seed-flaky binary-fuse-error-retry test was confirmed pre-existing in an earlier bisect). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dmkozh and others added 30 commits May 28, 2026 15:13

budget opt step 1

119e987

rollback env, update benchmark config

24c3ea3

disable test meta

083e0c8

validate txs in parallel, small improvement on some tests (?)

3e10875

Parallel pre-apply 5-20ms

cbe0cb5

profile flag for bench matrix

53ecfc4

Cache ledger info

87bb20e

add config flag for ledger close worker threads

eeaba98

Detailed apply stage breakdown

8e725ae

Optimize rescope using move.

1d2f2da

-5ms for 6400 SAC transfers scenario

add tracy support to bench matrix

80838cb

Optimize recordStorageChanges.

67f57bb

Use bitset instead of maps and relax invariants a bit. This is pretty impactful - -10ms apply time for SAC, -20ms apply time for soroswap

Remove extra lookup from upsert

690373f

update scenarios

d1e7c10

More robust path handling in apply load matrix script

9183b6b

dmkozh and others added 20 commits May 28, 2026 15:23

Cache LedgerKey hash in parallel apply data structures - ~-5ms

f92295e

Manual txset building instrumentation

8eb6ed4

storage opt

369444f

revert host module to p26

dc18b67

format

18a753e

fix a bug - in-memory state update shouldn't be conditioned on protoc…

1a1a8b0

…ol version

xdr opt

6c23ceb

use scratch buffer for XDR encode where applicable

6e37754

xdrpp xdr optimizations

c01b024

bump xdrpp: shared_bytes throws xdr_overflow on size-limit violation

c79c852

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XDR opt experiment#5300

XDR opt experiment#5300
dmkozh wants to merge 50 commits into
stellar:masterfrom
dmkozh:ai_opt_p27

dmkozh commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dmkozh commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants