A structured data repository of Philippine legislative documents. Every bill is a folder, every section is a file, and every file is plain Markdown with YAML frontmatter — readable by any tool, queryable by corpus-api, and writable only by corpus-explorer.
The repository exists because legislative data from the HREP API arrives in a form that is hard to query, impossible to search semantically, and not connected to the taxonomies that make it useful for product development. This corpus transforms that raw data into a stable, versioned, filterable store that multiple products can read from without duplicating ingestion logic.
corpus-explorer → ph-corpus → corpus-api
| Path | Description |
|---|---|
corpus.yaml |
Corpus identity (country, legislature, schema version) |
taxonomy.yaml |
Controlled vocabularies for all tag fields (versioned) |
tagging-prompt.md |
The exact LLM prompt used to tag provision sections |
members/{name_code}.json |
Cached legislator profiles (party, district, photo) |
{congress}/{chamber}/{bill_id}/_bill.md |
Bill-level metadata |
{congress}/{chamber}/{bill_id}/s{nn}.md |
Per-section provision files |
The naming conventions are deliberate. Congress, chamber, and bill ID are all in the folder path so you can navigate to any bill without a database. Section files are numbered to match the bill's own section numbering — s01.md is Section 1, s00.md is a preamble or unnumbered opening section.
Example paths:
20th-congress/house-bills/HB-00013/_bill.md
20th-congress/house-bills/HB-00013/s00.md ← explanatory note
20th-congress/house-bills/HB-00013/s01.md ← Section 1
20th-congress/house-bills/HB-00013/s02.md ← Section 2
members/De_Venecia.json
Most of the time you won't. corpus-explorer writes to it and corpus-api reads from it. You interact directly when:
- Updating the taxonomy — adding a new domain, provision type, or entity requires editing
taxonomy.yamlin this repo and re-tagging affected provisions. - Inspecting a specific bill — the files are plain text; you can read any bill or section directly without running the API.
- Debugging a tagging issue — section frontmatter and body text are both here; you can see exactly what was tagged and what text it came from.
- Updating
tagging-prompt.md— the prompt is versioned here so any change is tracked in git history alongside the provisions it produced.
When adding new domains, provision types, or entities to taxonomy.yaml, the update must be atomic — commit the taxonomy change and the re-tagged provisions together so the repo is never in an inconsistent state.
- Add the new value to
taxonomy.yaml— never rename or remove existing values - Bump the
versionnumber by 1 - Run
split_sections.py --retagincorpus-explorerto re-tag stale sections (those wheretaxonomy_version < new_version) - Commit
taxonomy.yamland the re-tagged section files together
Section files (s{nn}.md) are stable once written — the bill text doesn't change. Bill metadata files (_bill.md) need periodic updates for any bill still moving through congress, because fields like second_reading, votes, and republic_act fill in over time.
Run house_measures.py --update in corpus-explorer to re-fetch _bill.md for all bills where status_order < 10.00.
No code to run, no dependencies to install. This repository is a structured data store — populate it using corpus-explorer and query it using corpus-api.
- To populate it: use
corpus-explorer - To query it: use
corpus-api - To use it directly: read the files — they are plain Markdown with YAML frontmatter, readable by any tool that understands that format
Only .md and .json files are committed. PDF files and .md.cleaned marker files are excluded via .gitignore. PDFs are large and can always be re-downloaded from the HREP source; committing them would bloat the repo permanently. To regenerate Markdown from existing PDFs, re-run the Pass 2 scripts in corpus-explorer.
The full schema — including the exact format of every file, all frontmatter fields, and naming conventions — is in ph-corpus-schema.md. Always read it before modifying any file format or adding new fields.