Skip to content

perf(vanilla-io_uring): converge onto the epoll twin — zero-alloc rendering, /pipeline skip-decode, crud slab (impl vanilla#84, #85)#967

Open
enghitalo wants to merge 2 commits into
MDA2AV:mainfrom
enghitalo:perf/vanilla-io_uring-converge
Open

perf(vanilla-io_uring): converge onto the epoll twin — zero-alloc rendering, /pipeline skip-decode, crud slab (impl vanilla#84, #85)#967
enghitalo wants to merge 2 commits into
MDA2AV:mainfrom
enghitalo:perf/vanilla-io_uring-converge

Conversation

@enghitalo

Copy link
Copy Markdown
Contributor

What & why

vanilla-io_uring was an under-optimized copy of its vanilla-epoll twin: same handlers, but allocating throwaway strings per request, using json.encode/json.decode reflection on the DB paths, and fully parsing the request even for the fixed /pipeline blit. This PR ports every backend-agnostic optimization from the epoll entry so the two share one audited set of response builders and diff cleanly.

The io_uring backend supports only a stateless request_handler (no async_handler / TLS — see enghitalo/vanilla#83), so DB access stays on the blocking db.pg client. Everything that does not require the async runtime now matches epoll byte-for-byte.

Implements enghitalo/vanilla#84 (zero-alloc int parse/format) and enghitalo/vanilla#85 (crud: 1-query list, byte-rendered GET, fast crud-body parse).

Supersedes #965 (same work, rebased clean with the static fix folded in — no regression-then-fix history).

Changes (all entry-only, no lib change)

  • wi is now negative-aware — fixes a latent wrong body for a negative /baseline11 sum (a=-10&b=3 now returns -7).
  • emit / emit_int (stack scratch) / emit_xcache — zero-alloc response framing. /baseline11 and /upload no longer allocate an int -> string per request.
  • /pipeline skip-decode fast path — blit the constant before any parsing; decode_into (no Result boxing) on the main parse path.
  • render_item_pg — byte-level JSON straight from db.pg text rows, removing the per-request json.encode reflection on /async-db, /crud list and /crud GET.
  • crud cache is an id-indexed slab (replaces map[int]string) with in-place buffer reuse + cache-aside invalidation, shared across ring workers under RwMutex.
  • crud_list uses a single windowed query (count(*) OVER()) instead of a page SELECT + a separate count(*).
  • parse_crud_body_fast + borrowed JSON field parsers (json.decode fallback kept for escaped bodies).
  • parse_i64_slice / dechunk_into / parse_hex_slice — allocation-free query/body parsing.

Static: unlike the epoll twin, this entry does not set static_assets.sendfile_min_bytes. The io_uring backend has no sendfile path (no core.enable_sendfile / queue_file drain), so a low threshold makes static_assets read every large .br/.gz sibling from disk per request (blocking the ring). Keeping the default (256 KiB) preloads all arena siblings (< 256 KiB) and serves them as a zero-copy core.queue_buf borrowed send.

What is intentionally NOT changed

DB profiles (fortunes, async-db, api-*, crud) stay capped by the blocking db.pg on the single ring worker — the io_uring backend has no async runtime to await DB readiness on the ring (enghitalo/vanilla#83). A per-worker reused render scratch (dropping the small per-request DB-render buffers) is a follow-up now that io_uring supports make_state (enghitalo/vanilla#93 landed; entry follow-up enghitalo/vanilla#97).

Validation

  • Both images build (v -prod -d vanilla_tls).
  • Ran both containers against a pristine seeded Postgres (fresh DB per framework) and diffed every route — pipeline, baseline11 (positive and negative), upload, json, json-comp, async-db, fortunes, static (br negotiation), crud list, crud GET (MISS→HIT), crud create/update, 404, json-tlsall 17 byte-for-byte identical to vanilla-epoll.
  • X-Cache verified: GET MISS → HIT, re-MISS after a PUT (slab invalidation); POST → 201; json-compContent-Encoding: gzip; json-tls → 200 over TLS 1.3.
  • Static (the path that regressed in perf(vanilla-io_uring): converge onto the epoll twin — zero-alloc rendering, /pipeline skip-decode, crud slab (impl vanilla#84, #85) #965 before the fix): wrk at 64 conns on /static/vendor.js (→ 67 KB .br) serves ~101k req/s / 6.38 GB/s, zero socket errors (preloaded, bandwidth-bound).

🤖 Generated with Claude Code

…ering, /pipeline skip-decode, crud slab)

Port every backend-agnostic optimization from the vanilla-epoll entry so the two share
one audited set of response builders and diff cleanly. The io_uring backend supports
only a stateless request_handler (no async_handler / TLS — enghitalo/vanilla#83), so DB
access stays on the blocking db.pg client; everything else now matches epoll byte-for-byte
(verified: all 17 routes identical against a pristine seeded Postgres).

Implements enghitalo/vanilla#84 (zero-alloc int parse/format) and MDA2AV#85 (crud: 1-query
list, byte-rendered GET, fast body parse):

- wi: negative-aware (fixes a latent wrong body for a negative /baseline11 sum)
- emit / emit_int (stack scratch) / emit_xcache: zero-alloc response framing; /baseline11
  and /upload no longer allocate an int->string per request
- /pipeline: skip-decode fast path (blit the const before parsing) + decode_into
- render_item_pg: byte-level JSON straight from db.pg text rows — removes the per-request
  json.encode reflection on /async-db, /crud list, /crud GET
- crud cache: id-indexed slab (replaces map[int]string) with in-place buffer reuse and
  cache-aside invalidation, shared across ring workers under RwMutex
- crud_list: single windowed query (count(*) OVER()) instead of page + separate count(*)
- parse_crud_body_fast + borrowed json field parsers (json.decode fallback kept)
- parse_i64_slice / dechunk_into / parse_hex_slice: allocation-free parsing

Static: unlike the epoll twin, this does NOT set static_assets.sendfile_min_bytes — the
io_uring backend has no sendfile path (no core.enable_sendfile / queue_file drain), so a
low threshold would make static_assets read every large .br/.gz sibling from disk per
request (blocking the ring). Keeping the default (256 KiB) preloads all arena siblings
(< 256 KiB) and serves them as a zero-copy core.queue_buf borrowed send.

DB profiles remain capped by the blocking db.pg on the single ring worker
(enghitalo/vanilla#83). Per-worker reused render scratch is a follow-up now that io_uring
supports make_state (enghitalo/vanilla#93 done; entry follow-up enghitalo/vanilla#97).

Verified: both images build; every route byte-identical to vanilla-epoll on a pristine
seeded Postgres; X-Cache MISS->HIT->re-MISS-after-PUT holds; /static/vendor.js (67 KB .br)
serves ~101k req/s / 6.38 GB/s under wrk (preloaded, no disk-read collapse).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@enghitalo

Copy link
Copy Markdown
Contributor Author

/benchmark -f vanilla-io_uring

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

👋 /benchmark request received. A collaborator will review and approve the run.

…andler) — closes vanilla#97

Now that the io_uring backend supports make_state / stateful_handler (enghitalo/vanilla#94),
adopt the same per-worker-state shape as the epoll twin so the two entries are maximally
converged — only db.pg-blocking vs pg_async-async (enghitalo/vanilla#83) now separates
their handler code.

- Split Shared into SharedRO (process-shared: dataset, prefixes, asv, the shared thread-
  safe db.pg pool, and the mutex-guarded crud slab + gz cache) and a per-worker
  WorkerCtx { ro, scratch }.
- Dispatch through stateful_handler + make_state: each ring worker builds ONE WorkerCtx
  (its own reused render scratch), dropping the per-request []u8 the DB render paths
  (write_async_db / write_crud_list / write_crud_get MISS / write_fortunes) allocated —
  addresses the api-* memory growth seen in the CI run. High-RPS non-DB paths are
  unchanged (pipeline/baseline/json/static stay zero-alloc).
- Bump the pinned vanilla lib b189036 -> 6fb4244 (includes MDA2AV#94): the old pin REJECTS
  stateful_handler on io_uring at new_server() and panics on boot.

Verified: image builds; all 17 routes byte-identical to vanilla-epoll on a pristine
seeded Postgres (X-Cache MISS->HIT->re-MISS-after-PUT holds); wrk healthy on every path
(pipeline 241k, baseline 242k, json 194k, static/vendor.js 98k, crud 237k rps, zero
socket errors) — the stateful dispatch adds no measurable overhead and static stays fast.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@enghitalo

Copy link
Copy Markdown
Contributor Author

Pushed a follow-up commit (44ca0b6) that fully converges the io_uring entry onto the epoll twin, now that the io_uring backend supports per-worker state (enghitalo/vanilla#94, merged):

  • Split SharedSharedRO (process-shared data + mutex-guarded crud slab / gz cache) + per-worker WorkerCtx { ro, scratch }, dispatched via stateful_handler + make_state — same shape as vanilla-epoll.
  • DB render paths (write_async_db / write_crud_list / write_crud_get MISS / write_fortunes) now render into the worker's reused scratch instead of a per-request []u8 — removes the api-* memory growth from the earlier run. High-RPS non-DB paths unchanged.
  • Bumped the pinned vanilla lib b1890366fb4244 (includes Fix Json serializtion on cheating frameworks #94; the old pin rejects stateful_handler on io_uring at new_server() and panics on boot).

Closes enghitalo/vanilla#97. The only remaining handler-code divergence from the epoll twin is blocking db.pg vs async pg_async (enghitalo/vanilla#83).

Re-verified: all 17 routes byte-identical to vanilla-epoll on a pristine seeded Postgres; wrk healthy on every path (pipeline 241k, baseline 242k, json 194k, static/vendor.js 98k, crud 237k rps, zero socket errors).

@enghitalo

Copy link
Copy Markdown
Contributor Author

/benchmark -f vanilla-io_uring

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

👋 /benchmark request received. A collaborator will review and approve the run.

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results

Framework: vanilla-io_uring | Test: all tests

Test Conn RPS CPU Mem Δ RPS Δ Mem
baseline 512 2,689,087 5585.5% 1.5GiB -17.9% ~0%
baseline 4096 2,438,762 5540.9% 1.6GiB -33.3% +6.7%
pipelined 512 40,210,889 6675.6% 1012MiB +12.3% +2.5%
pipelined 4096 43,234,481 6490.4% 1.2GiB +14.0% +9.1%
limited-conn 512 1,992,794 5094.7% 1.5GiB -12.9% ~0%
limited-conn 4096 2,033,622 5112.3% 1.6GiB -6.9% +6.7%
json 4096 2,438,919 6332.2% 1.4GiB +1.0% ~0%
json-comp 512 2,145,833 6031.2% 1.0GiB +3.1% ~0%
json-comp 4096 2,881,558 6363.6% 1.4GiB +1.3% ~0%
json-comp 16384 2,705,063 5654.2% 1.9GiB +1634.4% -5.0%
json-tls 4096 1,505,969 6113.4% 1.5GiB +1.6% ~0%
upload 32 2,602 1872.8% 1.4GiB +0.7% ~0%
upload 256 2,958 3407.2% 1.4GiB -0.7% +7.7%
api-4 256 33,497 358.4% 1.8GiB +17.8% -10.0%
api-16 1024 14,928 1748.8% 3.7GiB -47.5% +76.2%
static 1024 1,448,771 5436.7% 1.0GiB ~0% +0.4%
static 4096 1,354,125 5439.4% 1.2GiB ~0% ~0%
static 6800 1,237,386 5677.7% 1.2GiB +0.3% -14.3%
async-db 1024 6,371 5784.6% 2.0GiB -42.2% +11.1%
crud 4096 196,439 891.3% 1.7GiB -15.0% -10.5%
fortunes 1024 277 5700.3% 1.3GiB +648.6% -23.5%
Full log
  Pipeline:  1
  Req/conn:  200
  Templates: 20
  Expected:  200
  Duration:  15s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   113.78ms   24.90ms   52.80ms   165.70ms    5.00s

  541660 requests in 15.00s, 539804 responses
  Throughput: 35.98K req/s
  Bandwidth:  10.90MB/s
  Status codes: 2xx=539804, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 539804 / 539804 responses (100.0%)
  Latency overflow (>5s): 4096
  Reconnects: 477
  Per-template: 26320,27572,28017,28343,27825,27871,27335,27250,27332,26300,27732,27991,27621,26616,28241,28247,21652,26234,25286,26019
  Per-template-ok: 26320,27572,28017,28343,27825,27871,27335,27250,27332,26300,27732,27991,27621,26616,28241,28247,21652,26234,25286,26019
[info] CPU 276.3% | Mem 1.5GiB

[run 2/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  200
  Templates: 20
  Expected:  200
  Duration:  15s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   21.38ms   16.30ms   45.70ms   164.30ms   218.90ms

  2865152 requests in 15.00s, 2865152 responses
  Throughput: 190.97K req/s
  Bandwidth:  60.39MB/s
  Status codes: 2xx=2865152, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 2865150 / 2865152 responses (100.0%)
  Reconnects: 12474
  Per-template: 132582,139896,145404,151164,150361,152398,148940,143845,144051,148584,149552,145983,147539,149719,150290,148029,138472,127885,124721,125735
  Per-template-ok: 132582,139896,145404,151164,150361,152398,148940,143845,144051,148584,149552,145983,147539,149719,150290,148029,138472,127885,124721,125735
[info] CPU 840.4% | Mem 1.6GiB

[run 3/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  200
  Templates: 20
  Expected:  200
  Duration:  15s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   20.79ms   16.00ms   42.20ms   168.10ms   214.80ms

  2946583 requests in 15.00s, 2946585 responses
  Throughput: 196.40K req/s
  Bandwidth:  62.08MB/s
  Status codes: 2xx=2946585, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 2946581 / 2946585 responses (100.0%)
  Reconnects: 12850
  Per-template: 135353,142208,145699,150725,152702,153893,155148,155855,155849,155941,152748,147485,149423,148836,149585,150761,144031,135610,132313,132416
  Per-template-ok: 135353,142208,145699,150725,152702,153893,155148,155855,155849,155941,152748,147485,149423,148836,149585,150761,144031,135610,132313,132416
[info] CPU 891.3% | Mem 1.7GiB

=== Best: 196439 req/s (CPU: 891.3%, Mem: 1.7GiB) ===
[info] input BW: 16.86MB/s (avg template: 90 bytes)
[info] saved results/crud/4096/vanilla-io_uring.json
httparena-bench-vanilla-io_uring
httparena-bench-vanilla-io_uring

==============================================
=== vanilla-io_uring / fortunes / 1024c (tool=gcannon) ===
==============================================
[info] resetting postgres for a clean per-profile baseline
[info] starting postgres sidecar
httparena-postgres
[info] postgres ready (seeded)
[info] waiting for server...
[info] server ready

[run 1/3]
gcannon v0.5.3
  Target:    localhost:8080/fortunes
  Threads:   64
  Conns:     1024 (16/thread)
  Pipeline:  1
  Req/conn:  unlimited (keep-alive)
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency    3.29s    3.53s    4.69s    5.00s    5.00s

  1193 requests in 5.00s, 1193 responses
  Throughput: 238 req/s
  Bandwidth:  5.65MB/s
  Status codes: 2xx=1193, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 1193 / 1193 responses (100.0%)
  Latency overflow (>5s): 45
[info] CPU 4299.6% | Mem 1.1GiB

[run 2/3]
gcannon v0.5.3
  Target:    localhost:8080/fortunes
  Threads:   64
  Conns:     1024 (16/thread)
  Pipeline:  1
  Req/conn:  unlimited (keep-alive)
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency    2.56s    2.65s    4.13s    4.91s    5.00s

  1389 requests in 5.00s, 1389 responses
  Throughput: 277 req/s
  Bandwidth:  6.58MB/s
  Status codes: 2xx=1389, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 1389 / 1389 responses (100.0%)
  Latency overflow (>5s): 13
[info] CPU 5700.3% | Mem 1.3GiB

[run 3/3]
gcannon v0.5.3
  Target:    localhost:8080/fortunes
  Threads:   64
  Conns:     1024 (16/thread)
  Pipeline:  1
  Req/conn:  unlimited (keep-alive)
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency    2.60s    2.74s    4.32s    5.00s    5.00s

  1381 requests in 5.00s, 1381 responses
  Throughput: 276 req/s
  Bandwidth:  6.54MB/s
  Status codes: 2xx=1381, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 1381 / 1381 responses (100.0%)
  Latency overflow (>5s): 20
[info] CPU 5519.7% | Mem 1.5GiB

=== Best: 277 req/s (CPU: 5700.3%, Mem: 1.3GiB) ===
[info] saved results/fortunes/1024/vanilla-io_uring.json
httparena-bench-vanilla-io_uring
httparena-bench-vanilla-io_uring
[info] skip: vanilla-io_uring does not subscribe to baseline-h2
[info] skip: vanilla-io_uring does not subscribe to static-h2
[info] skip: vanilla-io_uring does not subscribe to baseline-h2c
[info] skip: vanilla-io_uring does not subscribe to json-h2c
[info] skip: vanilla-io_uring does not subscribe to baseline-h3
[info] skip: vanilla-io_uring does not subscribe to static-h3
[info] skip: vanilla-io_uring does not subscribe to gateway-64
[info] skip: vanilla-io_uring does not subscribe to gateway-h3
[info] skip: vanilla-io_uring does not subscribe to production-stack
[info] skip: vanilla-io_uring does not subscribe to unary-grpc
[info] skip: vanilla-io_uring does not subscribe to unary-grpc-tls
[info] skip: vanilla-io_uring does not subscribe to stream-grpc
[info] skip: vanilla-io_uring does not subscribe to stream-grpc-tls
[info] skip: vanilla-io_uring does not subscribe to echo-ws
[info] skip: vanilla-io_uring does not subscribe to echo-ws-pipeline
[info] rebuilding site/data/*.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/frameworks.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/api-16-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/api-4-256.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/async-db-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/baseline-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/baseline-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/crud-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/fortunes-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-comp-16384.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-comp-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-comp-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-tls-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/limited-conn-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/limited-conn-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/pipelined-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/pipelined-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/static-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/static-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/static-6800.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/upload-256.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/upload-32.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/current.json
[info] done
httparena-postgres
httparena-redis
[info] restoring loopback MTU to 65536

@enghitalo

Copy link
Copy Markdown
Contributor Author

/benchmark -f vanilla-io_uring --save

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

👋 /benchmark request received. A collaborator will review and approve the run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant