Skip to content

Serializable G2-item anomaly in Jepsen append workload with INSERT ON CONFLICT #41

@NikolayS

Description

@NikolayS

Summary

Kyle Kingsbury reported a serializability violation against Postgres 18.4 in the Jepsen SQL append workload:

https://www.postgresql.org/message-id/165342c0-0c75-461e-b334-b997639ad48d@aphyr.com

I reproduced the anomaly locally only when rebuilding Kyle's setup in the full Jepsen-managed shape. The shortcut path using --existing-postgres against a local Postgres instance did not reproduce.

Reproduced

Local artifacts are on Max's Linux box under:

/home/tars/clawd-max/repro-g2-item

Main repro artifacts:

/home/tars/clawd-max/repro-g2-item/NOTES.md
/home/tars/clawd-max/repro-g2-item/jepsen-managed-10.log
/home/tars/clawd-max/repro-g2-item/postgres-jepsen-6c2bcc3/postgres/store/postgres append S (S) /20260530T164654.367Z

Result from the reproducing Jepsen store:

:workload {:valid? false, :anomaly-types (:G2-item)}

Stats:

2725 transactions
2015 ok
710 failed

Environment that reproduced

  • Jepsen repo: jepsen-io/postgres
  • Jepsen commit: 6c2bcc3f43085d3b0f21a5d78ba2b0e0e559ea8f
  • Node: Docker container jepsen-g2-n1, hostname n1
  • Docker image: built from /home/tars/clawd-max/repro-g2-item/Dockerfile.n1
  • Postgres package: 18.4-1.pgdg24.04+1
  • Jepsen-managed Postgres config inside container:
    • listen_addresses = '*'
    • log_statement = all
    • pg_hba.conf trust auth, Jepsen throwaway-node style
  • Postgres was not published on a host port.

Command that reproduced, with container IP from that run:

cd /home/tars/clawd-max/repro-g2-item/postgres-jepsen-6c2bcc3/postgres
PATH="$HOME/.local/bin:$PATH" lein run test-all \
  -n 172.17.0.15 \
  --ssh-private-key /home/tars/clawd-max/repro-g2-item/jepsen_n1_key \
  -w append \
  --time-limit 30 \
  --concurrency 10 \
  --log-sql \
  --isolation serializable \
  --no-savepoints \
  --max-writes-per-key 4 \
  --leave-db-running \
  --key-types primary \
  --upsert-types on-conflict \
  --test-count 10 \
  --nemesis none

The first run in that batch reproduced G2-item, so I stopped the remaining batch early.

Example anomaly

From:

/home/tars/clawd-max/repro-g2-item/postgres-jepsen-6c2bcc3/postgres/store/postgres append S (S) /20260530T164654.367Z/elle/G2-item.txt
G2-item #1
Let:
  T1 = {:index 190, :time 2263380509, :type :ok, :process 7, :f :txn, :value [[:r 35 nil] [:append 33 3] [:r 33 [1 2 3]]]}
  T2 = {:index 184, :time 2246090662, :type :ok, :process 5, :f :txn, :value [[:r 33 [1]] [:append 35 1] [:append 22 2] [:append 22 3]]}
  T3 = {:index 176, :time 2170160437, :type :ok, :process 1, :f :txn, :value [[:append 33 2]]}

Then:
  - T1 < T2, because T1 observed the initial (nil) state of 35, which T2 created by appending 1.
  - T2 < T3, because T2 did not observe T3's append of 2 to 33.
  - However, T3 < T1, because T1 appended 3 after T3 appended 2 to 33: a contradiction!

The reproduced store contains three G2-item cycles.

What did not reproduce

The local shortcut path did not reproduce:

  • extracted Postgres 18.4 Debian packages into a local temp tree
  • started local Postgres on port 55419
  • ran the same Jepsen commit and workload with --existing-postgres
  • 10 x 30-second runs: no anomaly

Log:

/home/tars/clawd-max/repro-g2-item/jepsen-10-6c2bcc3.log

A hand-written two-transaction schedule based on the obvious cycle also did not reproduce. Postgres correctly aborted one transaction with SSI serialization failure on both local 18.3 and extracted 18.4.

Script:

/home/tars/clawd-max/repro-g2-item/minimal_g2.py

Current state

This is a confirmed Jepsen-managed repro, but not yet a TAP-ready deterministic test. The next step is to reduce the reproducing Jepsen history into a smaller deterministic schedule, or identify the managed-node setup difference that makes --existing-postgres miss the bug.

The Docker container was intentionally left running for inspection after reproduction:

jepsen-g2-n1

If no longer needed, remove it with:

docker rm -f jepsen-g2-n1

Hypothesis / area to inspect

The anomaly occurs under serializable transactions using the SQL append workload with insert ... on conflict ... do update. The failed two-transaction minimization suggests the bug may need surrounding traffic, table/key distribution, timing, or a Jepsen-managed configuration/setup difference to expose the missing SSI dependency.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions