Multi-state campaign finance data pipeline: download state ethics commission extracts, validate and normalize into a unified cross-state schema, load into PostgreSQL, and resolve duplicate entities for analytics.
States in progress: Texas (most complete), Oklahoma, Ohio. Stack: Python 3.12, uv, SQLModel, Pydantic v2, Polars, Splink, PostgreSQL (prod) / SQLite (dev).
- Acquire — Selenium scrapers download CSV bundles from state portals
(
app/states/,app/workflows/). - Prepare — The
cfCLI converts CSV to parquet, verifies required record types, and chains download → convert → verify. - Ingest — Schema-driven
GenericFileReaderplus per-state SQLModel validators (app/ingest/,app/abcs/). - Unify —
app/core/maps state fields to shared tables (transactions, contributions, entities, committees, persons, addresses). - Load —
scripts/loaders/production_loader.pystreams batches into the database with deduplication caches. - Resolve —
app/resolve/standardizes names/addresses, blocks candidate pairs, scores with Splink, clusters merges, and publishes canonical entities plus crosswalks for downstream analytics.
| Document | Purpose |
|---|---|
| AGENTS.md | Agent instructions, commands, guardrails |
| docs/ARCHITECTURE.md | Narrative system design |
| docs/ARCHITECTURE-DIAGRAM.md | Mermaid pipeline + unified ERD |
| docs/DATA_RELATIONSHIPS.md | Full ERD (unified → canonical → views) |
| docs/RUNBOOK.md | Operations and troubleshooting |
| docs/adr/ | Architecture decision records |
Install dependencies with uv:
uv syncuv sync installs the cf console script and project package. probablepeople
pulls in doublemetaphone; the lockfile pins 1.2 (prebuilt wheels for
Python 3.12) via [tool.uv] override-dependencies because 1.1 only ships an
sdist that fails to compile on 3.12.
Copy .env.example to .env for local database and 1Password SDK settings.
Never commit .env.
The cf command prepares Texas campaign finance data for the resolution pipeline.
| Command | Description |
|---|---|
cf download texas |
Download and extract TEC CSV data via Selenium |
cf convert texas |
Convert extracted CSV/txt files to parquet |
cf verify texas |
Verify required record types are present |
cf prepare texas |
Run download → convert → verify in order |
Common options:
cf download texas --headless --overwrite --out /path/to/datacf convert texas --overwrite --no-keep-csvcf prepare texas --skip-download— convert and verify only
uv run cf prepare texasThis downloads the latest Texas Ethics Commission CSV bundle, converts files
under tmp/texas/ to parquet, and prints a coverage table. The command exits
non-zero if any required record type (RCPT, EXPN, LOAN, FILER, CVR1)
is missing or empty.
Use --out PATH on download or prepare to write files to a custom directory
instead of the default tmp/texas/.
uv run cf --version
uv run python -m app.cli --helpBoth uv run cf and uv run python -m app.cli work after uv sync; you do not
need PYTHONPATH=.:app.
# Load a sample into the dev database
uv run python scripts/loaders/production_loader.py testing texas_sample
# Run resolve pipeline (see docs/superpowers/specs/ for design)
uv run python -m app.resolve --helpuv run pytest tests/ app/tests/ -x
uv run pytest tests/resolve -m "not integration"See docs/TESTING.md for the full test strategy.