Generate realistic, linked, schema-valid Gen3 metadata from a bundled Gen3
JSON schema. Point it at a Gen3 data dictionary and it produces one JSON file
per node (plus a DataImportOrder.txt), with every foreign key resolving to a
real parent record — then self-validates the result with
gen3-validator.
Standing up or testing a Gen3 commons needs example data that conforms to your dictionary and links together correctly. Hand-authoring it is tedious and error-prone. This tool reads the dictionary, works out the node dependency order, and fills every node with simulated records that pass validation.
Requires Python ≥ 3.12.10 (a constraint inherited from gen3schemadev).
poetry installpoetry run gen3-metadata-simulator generate \
--schema examples/jsonschema/acdc_schema_v1.1.5.json \
--output-dir ./output \
--num-records 30 \
--project-code AusDiab_Simulated \
--seed 1This writes ./output/<node>.json for every node, plus DataImportOrder.txt,
and prints 0 validation errors on success. Re-running with the same --seed
reproduces byte-identical output. If validation fails, nothing is written.
| Flag | Default | Description |
|---|---|---|
--schema, -s |
(required) | Path to the bundled Gen3 JSON schema. |
--output-dir, -o |
./output |
Where to write the metadata files. |
--num-records, -n |
30 |
Records per node. |
--project-code, -p |
simulated_project |
Project code children link to. |
--seed |
(none) | RNG seed for reproducible output. |
--array-size |
0 |
Elements per array property (0 → []). |
--skip-validation |
off | Write without self-validating first. |
Run poetry run gen3-metadata-simulator generate --help for the full list, or
see docs/usage.md.
poetry run gen3-metadata-simulator validate \
--schema examples/jsonschema/acdc_schema_v1.1.5.json \
--metadata-dir ./outputproject.json— a single JSON object identified bycode.<node>.json— a JSON array of N records, each withtype, a uniquesubmitter_id, foreign-key objects ({"submitter_id": ...}, or{"code": ...}for links to the project), and schema-conforming property values.DataImportOrder.txt— node names in dependency order, one per line, ready to drive a sequential Gen3 submission.
- Resolve the schema (
gen3-validatorinlines every$ref). - Order nodes topologically so parents are generated before children.
- Generate records per node, wiring links to real parents.
- Validate the whole set with
gen3_validator.validate_list_dictand refuse to write anything that fails.
See docs/dev-notes.md for a full walkthrough of how it
works and docs/usage.md for every flag.
By default (--provider random) values are random within schema bounds. The
LLM provider instead asks a lightweight model for the semantic properties
of each field and samples from them, so output looks believable while still
validating:
- numeric — a distribution (mean ± stddev) and realistic limits, so
month_birthstays in[1, 12]andbmi_baselinelands near 27 ± 5; - dates — a real calendar date in a plausible window (no
3170-94-14), rendered to the schema's pattern; - free text — domain-appropriate strings (an assay
descriptionreads like a real one) drawn from an LLM-supplied pool.
Works with Anthropic or OpenAI models. Enums, booleans, and
pattern-constrained strings (UBERON / ORCID / md5sum) keep the random/regex
behavior. Specs are cached to .cache/distributions.json, so repeat runs make
no API calls and a fixed --seed is reproducible.
Copy the example env file and fill in three values — the vendor, the model, and
a path to a file holding your API key (the key never goes in .env or the
repo):
cp .env.example .env
# edit .env:
# LLM_PROVIDER=anthropic # or: openai
# LLM_MODEL=claude-haiku-4-5 # or e.g. gpt-4o-mini
# LLM_API_KEY_FILE=/path/to/your/api_key.env is gitignored. Then just select the LLM strategy — provider and model
come from .env:
poetry run gen3-metadata-simulator generate \
--schema examples/jsonschema/acdc_schema_v1.1.5.json \
--provider llm --num-records 5 --seed 1Override per run with --llm-provider anthropic|openai and --llm-model <id>.
See docs/usage.md for all flags and
docs/dev-notes.md for the design and the pluggable
ValueProvider / SpecSource interfaces.
- docs/dev-notes.md — start here. A ground-up, junior-dev-friendly walkthrough of how it all works: the pipeline, the value providers, a worked example, design decisions, a module map, and how to extend it.
- docs/usage.md — every CLI flag for
generateandvalidate, with examples.
poetry run python3 -m pytest # run the test suite (fully offline)The example dictionary in examples/jsonschema/ is the test fixture. The key
tests are the round-trips (tests/test_roundtrip.py, tests/test_roundtrip_llm.py):
generate → validate → assert zero errors. New to the codebase? Read
docs/dev-notes.md first.