perf(vanilla-io_uring): converge onto the epoll twin — zero-alloc rendering, /pipeline skip-decode, crud slab (impl vanilla#84, #85)#967
Conversation
…ering, /pipeline skip-decode, crud slab) Port every backend-agnostic optimization from the vanilla-epoll entry so the two share one audited set of response builders and diff cleanly. The io_uring backend supports only a stateless request_handler (no async_handler / TLS — enghitalo/vanilla#83), so DB access stays on the blocking db.pg client; everything else now matches epoll byte-for-byte (verified: all 17 routes identical against a pristine seeded Postgres). Implements enghitalo/vanilla#84 (zero-alloc int parse/format) and MDA2AV#85 (crud: 1-query list, byte-rendered GET, fast body parse): - wi: negative-aware (fixes a latent wrong body for a negative /baseline11 sum) - emit / emit_int (stack scratch) / emit_xcache: zero-alloc response framing; /baseline11 and /upload no longer allocate an int->string per request - /pipeline: skip-decode fast path (blit the const before parsing) + decode_into - render_item_pg: byte-level JSON straight from db.pg text rows — removes the per-request json.encode reflection on /async-db, /crud list, /crud GET - crud cache: id-indexed slab (replaces map[int]string) with in-place buffer reuse and cache-aside invalidation, shared across ring workers under RwMutex - crud_list: single windowed query (count(*) OVER()) instead of page + separate count(*) - parse_crud_body_fast + borrowed json field parsers (json.decode fallback kept) - parse_i64_slice / dechunk_into / parse_hex_slice: allocation-free parsing Static: unlike the epoll twin, this does NOT set static_assets.sendfile_min_bytes — the io_uring backend has no sendfile path (no core.enable_sendfile / queue_file drain), so a low threshold would make static_assets read every large .br/.gz sibling from disk per request (blocking the ring). Keeping the default (256 KiB) preloads all arena siblings (< 256 KiB) and serves them as a zero-copy core.queue_buf borrowed send. DB profiles remain capped by the blocking db.pg on the single ring worker (enghitalo/vanilla#83). Per-worker reused render scratch is a follow-up now that io_uring supports make_state (enghitalo/vanilla#93 done; entry follow-up enghitalo/vanilla#97). Verified: both images build; every route byte-identical to vanilla-epoll on a pristine seeded Postgres; X-Cache MISS->HIT->re-MISS-after-PUT holds; /static/vendor.js (67 KB .br) serves ~101k req/s / 6.38 GB/s under wrk (preloaded, no disk-read collapse). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
/benchmark -f vanilla-io_uring |
|
👋 |
…andler) — closes vanilla#97 Now that the io_uring backend supports make_state / stateful_handler (enghitalo/vanilla#94), adopt the same per-worker-state shape as the epoll twin so the two entries are maximally converged — only db.pg-blocking vs pg_async-async (enghitalo/vanilla#83) now separates their handler code. - Split Shared into SharedRO (process-shared: dataset, prefixes, asv, the shared thread- safe db.pg pool, and the mutex-guarded crud slab + gz cache) and a per-worker WorkerCtx { ro, scratch }. - Dispatch through stateful_handler + make_state: each ring worker builds ONE WorkerCtx (its own reused render scratch), dropping the per-request []u8 the DB render paths (write_async_db / write_crud_list / write_crud_get MISS / write_fortunes) allocated — addresses the api-* memory growth seen in the CI run. High-RPS non-DB paths are unchanged (pipeline/baseline/json/static stay zero-alloc). - Bump the pinned vanilla lib b189036 -> 6fb4244 (includes MDA2AV#94): the old pin REJECTS stateful_handler on io_uring at new_server() and panics on boot. Verified: image builds; all 17 routes byte-identical to vanilla-epoll on a pristine seeded Postgres (X-Cache MISS->HIT->re-MISS-after-PUT holds); wrk healthy on every path (pipeline 241k, baseline 242k, json 194k, static/vendor.js 98k, crud 237k rps, zero socket errors) — the stateful dispatch adds no measurable overhead and static stays fast. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed a follow-up commit (44ca0b6) that fully converges the io_uring entry onto the epoll twin, now that the io_uring backend supports per-worker state (enghitalo/vanilla#94, merged):
Closes enghitalo/vanilla#97. The only remaining handler-code divergence from the epoll twin is blocking Re-verified: all 17 routes byte-identical to |
|
/benchmark -f vanilla-io_uring |
|
👋 |
Benchmark ResultsFramework:
Full log |
|
/benchmark -f vanilla-io_uring --save |
|
👋 |
What & why
vanilla-io_uringwas an under-optimized copy of itsvanilla-epolltwin: same handlers, but allocating throwaway strings per request, usingjson.encode/json.decodereflection on the DB paths, and fully parsing the request even for the fixed/pipelineblit. This PR ports every backend-agnostic optimization from the epoll entry so the two share one audited set of response builders and diff cleanly.The io_uring backend supports only a stateless
request_handler(noasync_handler/ TLS — see enghitalo/vanilla#83), so DB access stays on the blockingdb.pgclient. Everything that does not require the async runtime now matches epoll byte-for-byte.Implements enghitalo/vanilla#84 (zero-alloc int parse/format) and enghitalo/vanilla#85 (crud: 1-query list, byte-rendered GET, fast crud-body parse).
Changes (all entry-only, no lib change)
wiis now negative-aware — fixes a latent wrong body for a negative/baseline11sum (a=-10&b=3now returns-7).emit/emit_int(stack scratch) /emit_xcache— zero-alloc response framing./baseline11and/uploadno longer allocate anint -> stringper request./pipelineskip-decode fast path — blit the constant before any parsing;decode_into(no Result boxing) on the main parse path.render_item_pg— byte-level JSON straight fromdb.pgtext rows, removing the per-requestjson.encodereflection on/async-db,/crudlist and/crudGET.map[int]string) with in-place buffer reuse + cache-aside invalidation, shared across ring workers underRwMutex.crud_listuses a single windowed query (count(*) OVER()) instead of a pageSELECT+ a separatecount(*).parse_crud_body_fast+ borrowed JSON field parsers (json.decodefallback kept for escaped bodies).parse_i64_slice/dechunk_into/parse_hex_slice— allocation-free query/body parsing.Static: unlike the epoll twin, this entry does not set
static_assets.sendfile_min_bytes. The io_uring backend has no sendfile path (nocore.enable_sendfile/queue_filedrain), so a low threshold makesstatic_assetsread every large.br/.gzsibling from disk per request (blocking the ring). Keeping the default (256 KiB) preloads all arena siblings (< 256 KiB) and serves them as a zero-copycore.queue_bufborrowed send.What is intentionally NOT changed
DB profiles (
fortunes,async-db,api-*,crud) stay capped by the blockingdb.pgon the single ring worker — the io_uring backend has no async runtime to await DB readiness on the ring (enghitalo/vanilla#83). A per-worker reused render scratch (dropping the small per-request DB-render buffers) is a follow-up now that io_uring supportsmake_state(enghitalo/vanilla#93 landed; entry follow-up enghitalo/vanilla#97).Validation
v -prod -d vanilla_tls).pipeline,baseline11(positive and negative),upload,json,json-comp,async-db,fortunes,static(br negotiation),crudlist,crudGET (MISS→HIT),crudcreate/update, 404,json-tls— all 17 byte-for-byte identical to vanilla-epoll.GETMISS → HIT, re-MISS after aPUT(slab invalidation);POST→ 201;json-comp→Content-Encoding: gzip;json-tls→ 200 over TLS 1.3.wrkat 64 conns on/static/vendor.js(→ 67 KB.br) serves ~101k req/s / 6.38 GB/s, zero socket errors (preloaded, bandwidth-bound).🤖 Generated with Claude Code