Skip to content

XDR opt experiment#5300

Draft
dmkozh wants to merge 50 commits into
stellar:masterfrom
dmkozh:ai_opt_p27
Draft

XDR opt experiment#5300
dmkozh wants to merge 50 commits into
stellar:masterfrom
dmkozh:ai_opt_p27

Conversation

@dmkozh
Copy link
Copy Markdown
Contributor

@dmkozh dmkozh commented May 29, 2026

No description provided.

dmkozh and others added 30 commits May 28, 2026 15:13
Replace xdrSha256(success) with streaming SHA256 calculation to avoid
XDR re-serialization of InvokeHostFunctionSuccessPreImage. The return
value and events are already available as XDR-encoded bytes, so we can
hash them directly without round-trip serialization.
Adds parallel processing to transaction set handling:

1. Parallel TxFrame creation: Creates TxFrames from XDR envelopes in
   parallel during transaction set deserialization. Uses work-stealing
   via std::async with even distribution across available threads.

2. Parallel transaction validation: Validates transactions in parallel
   in txsAreValid() when there are 2+ transactions.

3. Hash precomputation: Precomputes content and full hashes before
   parallel operations to avoid race conditions.

4. Test coverage: Adds StreamingShaTest for InvokeHostFunctionSuccessPreImage
   verification.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add sizeBytes field to ContractDataMapEntryT to cache the XDR serialized
size of ledger entries. This avoids repeated xdr_size() calls during
state updates, reducing CPU overhead in the hot path.

Also adds Tracy zone to updateState() for profiling visibility.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
During ledger close, three independent operations are now parallelized:
- addHotArchiveBatch (modifies mHotArchiveBucketList)
- addLiveBatch (modifies mLiveBucketList) - runs on main thread
- updateInMemorySorobanState (modifies mInMemorySorobanState)

These operations modify completely independent data structures and can
safely run concurrently. Added getInMemorySorobanStateForUpdate() to
allow direct access to mInMemorySorobanState during COMMITTING phase.

This reduces ledger close latency by overlapping CPU-bound operations.

# Conflicts:
#	src/ledger/LedgerManagerImpl.cpp
-5ms for 6400 SAC transfers scenario
libsodium uses a portable C SHA256 implementation, missing SHA-NI hardware
instructions available on Intel Xeon Platinum. OpenSSL automatically uses
SHA-NI, providing 4.6x speedup for streaming add() (893ns->193ns/call) and
56% total SHA256 self-time reduction (3,744ms->1,659ms per 30s trace).

Use opaque aligned storage for SHA256_CTX in the header to avoid naming
conflict between OpenSSL's ::SHA256 function and stellar::SHA256 class.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…5ms/ledger)

Run LiveBucketIndex construction on async worker thread in parallel with
the put loop in mergeInMemory. Both read mergedEntries as const — fully
independent. Tracy confirms full overlap: index future wait averages 2.2µs.
finalizeLedgerTxnChanges drops from 164ms to 136ms per ledger.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When ledgerCloseMeta is null (meta tracking disabled), operate directly
on the parent LTX in processFeesSeqNums and processPostTxSetApply instead
of creating a child LTX per-transaction. The child LTX was only needed
for getChanges() meta tracking.

Saves ~41ms/ledger from eliminating ~10.6K child LTX create/commit
cycles. Combined with experiment 011 (meta tracking), TPS improves
from 10,688 to 12,736 (+19.2%).

Also raises APPLY_LOAD_MAX_SAC_TPS_MAX_TPS from 12000 to 15000.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts:
#	docs/apply-load-max-sac-tps.cfg
In commitChangesToLedgerTxn, determining whether an entry is INIT (new)
vs LIVE (existing) required calling mInMemorySorobanState.get() which
computes sha256(xdr_to_opaque(key)) for every CONTRACT_DATA entry.
With ~40K entries per ledger, this added ~16ms of SHA256 per ledger.

Track existence via a bool mIsNew flag in ParallelApplyEntry, set when
a TX creates an entry that didn't previously exist. This replaces the
expensive SHA256-based existence check with a simple boolean.

commitChangesToLedgerTxn: 72.6ms -> 44.2ms (-39%)
TPS: 16,640 -> 16,960 (+1.9%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts:
#	src/transactions/ParallelApplyUtils.cpp
Add move overloads for createWithoutLoading/updateWithoutLoading and
ScopedLedgerEntryOpt::moveFromScope to eliminate two deep copies per
entry when committing parallel apply state to LedgerTxn. Reduces
commitChangesToLedgerTxn from 44ms to 39ms per ledger (-12.8%).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-load Soroban read-only entries (contract instance, code, TTL) into
the global parallel apply state during setup, so per-TX lookups hit
thread-local maps instead of traversing to InMemorySorobanState. Also
cache protocol version and skip Soroban merge tracking in
processFeesSeqNums, and use std::move for mLatestTxResultSet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts:
#	docs/success/049-skip-child-ltx-processFeesSeqNums.md
Use bitset instead of maps and relax invariants a bit.

This is pretty impactful - -10ms apply time for SAC, -20ms apply time for soroswap
Pre-compute expected entry counts from footprint sizes and call reserve()
on ParallelApplyEntryMap containers before they accumulate entries.
Eliminates log2(N) rehash operations during parallel apply, yielding
-26% commitChangesFromThread and -27% commitChangesToLedgerTxn self-time.

+576 TPS (+3.1%): 18,368 → 18,944

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts:
#	src/transactions/ParallelApplyUtils.cpp
resolveBackgroundEvictionScan previously received an UnorderedSet<LedgerKey>
built by getAllKeysWithoutSealing() containing ~128K entries (~20ms to build),
but only performed ~10-100 lookups. Added isModifiedKey() to LedgerTxn for
direct O(1) lookups in the existing EntryMap, eliminating the set construction.

resolveEviction zone: 20ms -> 0.116ms per ledger (99.4% reduction).
TPS: 18,944 -> 19,328 avg (+2.0%).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace single global mutex + RandomEvictionCache with 16 sharded caches,
each with its own mutex. This eliminates contention when 4 parallel threads
verify signatures simultaneously. Also use maybeGet() instead of exists()+get()
double-lookup, fix ZoneText string heap allocations, make counters atomic,
and remove unused liveSnapshot copy in applySorobanStageClustersInParallel.
Sort lightweight 24-byte EntryRef structs (type tag + pointer) instead of
full BucketEntry objects (200-500 bytes) in convertToBucketEntry. Reduces
sort swap cost by ~12x and materializes final vector in one cache-friendly
sequential pass. Cuts convertToBucketEntry from 31.9ms to 25.4ms per ledger.

Benchmark: 13,760 -> 14,144 TPS (+384 TPS, +2.8%)
Skip building LedgerTxnDelta in setEffectsDeltaFromSuccessfulTx when
INVARIANT_CHECKS is empty. The delta is consumed exclusively by
checkOnOperationApply which iterates an empty list when no invariants
are configured. This eliminates ~285ms of shared_ptr allocations and
entry copies across 4 worker threads per ledger.

Benchmark: 12,736 -> 13,760 TPS (+1,024 TPS, +8.0%)
dmkozh and others added 20 commits May 28, 2026 15:23
LedgerSnapshot was renamed to CheckValidLedgerViewWrapper and
ApplyLedgerStateSnapshot to ApplyLedgerView in upstream's LedgerState
refactor. Branch's parallel pre-apply paths used the old names; rename
to match. ApplyLedgerView privately inherits from ImmutableLedgerView,
so use executeWithMaybeInnerSnapshot to derive a
CheckValidLedgerViewWrapper from it for the read-only pre-apply paths.
The previous adaptation used ApplyLedgerView::executeWithMaybeInnerSnapshot
to derive a CheckValidLedgerViewWrapper, but ImmutableLedgerView (and
therefore ApplyLedgerView via using-declaration) explicitly throws on that
call. Instead, add a narrow accessor that hands out the underlying
ImmutableLedgerView and use the existing
CheckValidLedgerViewWrapper(ImmutableLedgerView const&) constructor.
The branch's parallel TxFrame creation paths only checked
XDRProvidesValidFee() but missed the getInclusionFee() <= 0 check that
upstream added in the sequential equivalents. Restore parity so
generalized tx sets with negative-fee txs are rejected during construction.
- BucketBase.cpp: use auto const& instead of auto when retrieving entries
  from input iterators in mergeCasesWithDefaultAcceptance, avoiding one
  full BucketEntry deep copy per entry in the dominant merge path.

- TxSetFrame.cpp: add missing std::move when initializing mStages from
  a named rvalue reference parameter, which silently copied the entire
  nested vector structure.
- writeOne now attempts an optimistic single-pass serialize when the
  buffer is already large enough, skipping the xdr_size() pre-traversal.
  Falls back to the two-pass path (with 2x headroom growth) on overflow.
  After warm-up, eliminates one full XDR tree traversal per entry written.

- Add writeRaw() method that writes pre-serialized framed record bytes
  directly to the output stream, supporting SHA-256 hashing and byte
  counting. This enables raw byte passthrough in bucket merge.
During bucket merges, entries that pass through unchanged (not merged
with a counterpart from the other input) are read, deserialized, then
re-serialized on output. This commit adds infrastructure to retain the
raw framed bytes from XDRInputFileStream::readOne() and pass them
directly to the output stream via writeRaw(), bypassing serialization.

Key changes:
- XDRInputFileStream captures raw bytes during readOne() via mRawBuf
- BucketInputIterator exposes moveRawBytes() to retrieve raw entry bytes
- BucketOutputIterator gains putWithRaw() that buffers raw bytes alongside
  the deserialized entry, using writeRaw() on flush instead of writeOne()
- BucketBase merge lambda uses address comparison to detect unmodified
  entries and routes them through the raw passthrough path
opaque_vec<N> is now xdr::shared_bytes<N> (a refcounted, copy-on-write
byte container) rather than std::vector<uint8_t>. A handful of call
sites assumed the std::vector representation and need adjustment:

- ByteSlice gains a templated ctor for shared_bytes<N>, matching the
  existing opaque_array ctor.
- KeyFunctions<{PublicKey,SignerKey}>::getKeyValue construct std::vector
  explicitly from the (now shared_bytes) xdr_to_opaque result.
- toCxxBuf builds the unique_ptr<vector<uint8_t>> from iterators since
  the Rust bridge's CxxBuf type still wraps a std::vector.
- SharedModuleCacheCompiler stops using xvector<uint8_t> directly; switch
  to opaque_vec<> so its API tracks the typedef.
- StreamingShaTest: same.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
readOne now reads each framed record into a per-record
std::shared_ptr<uint8_t[]> buffer and decodes through the shared-buffer
aware xdr_get archive. shared_bytes fields in the decoded value
(opaque_vec, contract code, SCBytes payloads, etc.) become refcounted
views into the read buffer instead of getting freshly allocated +
memcpied from it.

For records with a large opaque<> payload — Wasm uploads, contract data
entries, large LedgerCloseMeta segments — this cuts the decode cost
from "1 alloc + payload memcpy per opaque<> field" to "1 refcount bump
+ 16 bytes of bookkeeping per field."

Buffer management:
- If no outstanding references remain (mReadBuf.use_count() == 1) and
  the existing capacity is sufficient, the buffer is reused in place.
- If the existing capacity is too small and the buffer is uniquely
  owned, it's reallocated with amortized doubling.
- If outstanding shared_bytes views still reference the previous
  buffer (use_count > 1), a fresh buffer is allocated sized exactly
  to this record's payload so the old buffer can be released as soon
  as its dependents drop, avoiding capacity-pinning of large prior
  records by accumulated small ones (e.g. bucket index build).

XDR-framed records are required to be 4-byte aligned by RFC 4506, so
the xdr_get end-pointer alignment invariant is satisfied directly by
\c sz and we can call \c g.done() like the previous implementation.

Verified: full [bucket] suite (47 cases, 1.79M assertions) and full
[soroban] suite (113 cases, 3.54M assertions) pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes on the Soroban host ↔ C++ XDR boundary:

1. toCxxBuf encodes directly into the destination std::vector<uint8_t>
   via the new xdr_to_vector helper. Previously the typedef flip from
   xvector<uint8_t> to xdr::shared_bytes left toCxxBuf paying an extra
   shared_bytes allocation + element-by-element copy on every encode,
   doubling the alloc/memcpy cost on the hot host-input encode path
   (LedgerEntry buffers for footprint, auth entries, host function
   arg, source ID, resources). The new path is a single resize + xdr
   encode pass into the vector that ends up in the CxxBuf.

2. The four xdr_from_opaque sites in InvokeHostFunctionOpFrame's
   output decoding (modified ledger entries, contract events,
   diagnostic events, host function return value) now go through
   xdrFromHostBytes, which wraps the rust::Vec<uint8_t> payload in a
   shared_ptr<uint8_t[]> once and decodes through the shared-buffer
   xdr_get. opaque<> fields in the decoded value alias into that
   single wrap buffer rather than each allocating their own. For
   LedgerEntries that include large Wasm/contract code blobs this
   collapses N per-field allocations into 1 wrap allocation.

xdrFromHostBytes lives in TransactionUtils.h alongside toCxxBuf.

Verified: full [bucket] (47 cases, 1.79M assertions) and full
[soroban] (113 cases, 3.54M assertions) suites pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed_ptr alias)

xdrFromHostBytes used to allocate a fresh \c shared_ptr<uint8_t[]> and
memcpy the host-returned \c rust::Vec<uint8_t> payload into it before
handing the decoder a shared backing buffer. Since the host-side
buffer is dropped immediately after decode, the copy is wasted —
the helper now moves \c rust::Vec into a heap-allocated holder and
hands the decoder a \c shared_ptr aliased to the holder's bytes via
the aliasing constructor. The decoded value's \c shared_bytes fields
share the holder. Holder (and the Rust allocation it owns) is freed
when the last refcount drops.

For an N-byte host-returned payload this turns the wrap from
"1 alloc + 1 memcpy of N" into "1 small alloc for the holder + control
block (~48 B)" with no payload copy at all.

Call sites in InvokeHostFunctionOpFrame switch to non-const iteration
and move-in:
- maybePopulateOutputDiagnosticEvents
- recordStorageChanges
- collectEvents
- finalizeSuccess

recordStorageChanges captures the encoded \c entrySize before moving
\c buf.data into the decoder (it was previously read twice from
buf.data.size() — once for the validation call and once for the
metric counter — both reads happen post-move otherwise).

The streaming SHA256 over the InvokeHostFunctionSuccessPreImage
encoding used to run in finalizeSuccess and reads the original
host-returned \c out.result_value.data + \c out.contract_events[].data
bytes. Those bytes are now moved into the decoder by
collectEvents/finalizeSuccess; we therefore hoist the hash up: it now
runs in collectEvents before any moves and the result is stored in
\c mSuccessPreImageHash for finalizeSuccess to consume. Same byte
sequence, same hash, no extra buffer materialized.

Verified: [ledgerclosemeta] suite passes; full \c make check is
otherwise green (the day-seed-flaky binary-fuse-error-retry test was
confirmed pre-existing in an earlier bisect).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants