Summary
Kyle Kingsbury reported a serializability violation against Postgres 18.4 in the Jepsen SQL append workload:
https://www.postgresql.org/message-id/165342c0-0c75-461e-b334-b997639ad48d@aphyr.com
I reproduced the anomaly locally only when rebuilding Kyle's setup in the full Jepsen-managed shape. The shortcut path using --existing-postgres against a local Postgres instance did not reproduce.
Reproduced
Local artifacts are on Max's Linux box under:
/home/tars/clawd-max/repro-g2-item
Main repro artifacts:
/home/tars/clawd-max/repro-g2-item/NOTES.md
/home/tars/clawd-max/repro-g2-item/jepsen-managed-10.log
/home/tars/clawd-max/repro-g2-item/postgres-jepsen-6c2bcc3/postgres/store/postgres append S (S) /20260530T164654.367Z
Result from the reproducing Jepsen store:
:workload {:valid? false, :anomaly-types (:G2-item)}
Stats:
2725 transactions
2015 ok
710 failed
Environment that reproduced
- Jepsen repo:
jepsen-io/postgres
- Jepsen commit:
6c2bcc3f43085d3b0f21a5d78ba2b0e0e559ea8f
- Node: Docker container
jepsen-g2-n1, hostname n1
- Docker image: built from
/home/tars/clawd-max/repro-g2-item/Dockerfile.n1
- Postgres package:
18.4-1.pgdg24.04+1
- Jepsen-managed Postgres config inside container:
listen_addresses = '*'
log_statement = all
pg_hba.conf trust auth, Jepsen throwaway-node style
- Postgres was not published on a host port.
Command that reproduced, with container IP from that run:
cd /home/tars/clawd-max/repro-g2-item/postgres-jepsen-6c2bcc3/postgres
PATH="$HOME/.local/bin:$PATH" lein run test-all \
-n 172.17.0.15 \
--ssh-private-key /home/tars/clawd-max/repro-g2-item/jepsen_n1_key \
-w append \
--time-limit 30 \
--concurrency 10 \
--log-sql \
--isolation serializable \
--no-savepoints \
--max-writes-per-key 4 \
--leave-db-running \
--key-types primary \
--upsert-types on-conflict \
--test-count 10 \
--nemesis none
The first run in that batch reproduced G2-item, so I stopped the remaining batch early.
Example anomaly
From:
/home/tars/clawd-max/repro-g2-item/postgres-jepsen-6c2bcc3/postgres/store/postgres append S (S) /20260530T164654.367Z/elle/G2-item.txt
G2-item #1
Let:
T1 = {:index 190, :time 2263380509, :type :ok, :process 7, :f :txn, :value [[:r 35 nil] [:append 33 3] [:r 33 [1 2 3]]]}
T2 = {:index 184, :time 2246090662, :type :ok, :process 5, :f :txn, :value [[:r 33 [1]] [:append 35 1] [:append 22 2] [:append 22 3]]}
T3 = {:index 176, :time 2170160437, :type :ok, :process 1, :f :txn, :value [[:append 33 2]]}
Then:
- T1 < T2, because T1 observed the initial (nil) state of 35, which T2 created by appending 1.
- T2 < T3, because T2 did not observe T3's append of 2 to 33.
- However, T3 < T1, because T1 appended 3 after T3 appended 2 to 33: a contradiction!
The reproduced store contains three G2-item cycles.
What did not reproduce
The local shortcut path did not reproduce:
- extracted Postgres 18.4 Debian packages into a local temp tree
- started local Postgres on port
55419
- ran the same Jepsen commit and workload with
--existing-postgres
- 10 x 30-second runs: no anomaly
Log:
/home/tars/clawd-max/repro-g2-item/jepsen-10-6c2bcc3.log
A hand-written two-transaction schedule based on the obvious cycle also did not reproduce. Postgres correctly aborted one transaction with SSI serialization failure on both local 18.3 and extracted 18.4.
Script:
/home/tars/clawd-max/repro-g2-item/minimal_g2.py
Current state
This is a confirmed Jepsen-managed repro, but not yet a TAP-ready deterministic test. The next step is to reduce the reproducing Jepsen history into a smaller deterministic schedule, or identify the managed-node setup difference that makes --existing-postgres miss the bug.
The Docker container was intentionally left running for inspection after reproduction:
If no longer needed, remove it with:
docker rm -f jepsen-g2-n1
Hypothesis / area to inspect
The anomaly occurs under serializable transactions using the SQL append workload with insert ... on conflict ... do update. The failed two-transaction minimization suggests the bug may need surrounding traffic, table/key distribution, timing, or a Jepsen-managed configuration/setup difference to expose the missing SSI dependency.
Summary
Kyle Kingsbury reported a serializability violation against Postgres 18.4 in the Jepsen SQL append workload:
https://www.postgresql.org/message-id/165342c0-0c75-461e-b334-b997639ad48d@aphyr.com
I reproduced the anomaly locally only when rebuilding Kyle's setup in the full Jepsen-managed shape. The shortcut path using
--existing-postgresagainst a local Postgres instance did not reproduce.Reproduced
Local artifacts are on Max's Linux box under:
Main repro artifacts:
Result from the reproducing Jepsen store:
:workload {:valid? false, :anomaly-types (:G2-item)}Stats:
Environment that reproduced
jepsen-io/postgres6c2bcc3f43085d3b0f21a5d78ba2b0e0e559ea8fjepsen-g2-n1, hostnamen1/home/tars/clawd-max/repro-g2-item/Dockerfile.n118.4-1.pgdg24.04+1listen_addresses = '*'log_statement = allpg_hba.conftrust auth, Jepsen throwaway-node styleCommand that reproduced, with container IP from that run:
The first run in that batch reproduced
G2-item, so I stopped the remaining batch early.Example anomaly
From:
The reproduced store contains three
G2-itemcycles.What did not reproduce
The local shortcut path did not reproduce:
55419--existing-postgresLog:
A hand-written two-transaction schedule based on the obvious cycle also did not reproduce. Postgres correctly aborted one transaction with SSI serialization failure on both local 18.3 and extracted 18.4.
Script:
Current state
This is a confirmed Jepsen-managed repro, but not yet a TAP-ready deterministic test. The next step is to reduce the reproducing Jepsen history into a smaller deterministic schedule, or identify the managed-node setup difference that makes
--existing-postgresmiss the bug.The Docker container was intentionally left running for inspection after reproduction:
If no longer needed, remove it with:
Hypothesis / area to inspect
The anomaly occurs under serializable transactions using the SQL append workload with
insert ... on conflict ... do update. The failed two-transaction minimization suggests the bug may need surrounding traffic, table/key distribution, timing, or a Jepsen-managed configuration/setup difference to expose the missing SSI dependency.