Skip to content

policyobservatory/ph-corpus

Repository files navigation

ph-corpus

A structured data repository of Philippine legislative documents. Every bill is a folder, every section is a file, and every file is plain Markdown with YAML frontmatter — readable by any tool, queryable by corpus-api, and writable only by corpus-explorer.

The repository exists because legislative data from the HREP API arrives in a form that is hard to query, impossible to search semantically, and not connected to the taxonomies that make it useful for product development. This corpus transforms that raw data into a stable, versioned, filterable store that multiple products can read from without duplicating ingestion logic.

corpus-explorer  →  ph-corpus  →  corpus-api

How files are organized

Path Description
corpus.yaml Corpus identity (country, legislature, schema version)
taxonomy.yaml Controlled vocabularies for all tag fields (versioned)
tagging-prompt.md The exact LLM prompt used to tag provision sections
members/{name_code}.json Cached legislator profiles (party, district, photo)
{congress}/{chamber}/{bill_id}/_bill.md Bill-level metadata
{congress}/{chamber}/{bill_id}/s{nn}.md Per-section provision files

The naming conventions are deliberate. Congress, chamber, and bill ID are all in the folder path so you can navigate to any bill without a database. Section files are numbered to match the bill's own section numbering — s01.md is Section 1, s00.md is a preamble or unnumbered opening section.

Example paths:

20th-congress/house-bills/HB-00013/_bill.md
20th-congress/house-bills/HB-00013/s00.md   ← explanatory note
20th-congress/house-bills/HB-00013/s01.md   ← Section 1
20th-congress/house-bills/HB-00013/s02.md   ← Section 2
members/De_Venecia.json

When you'd interact with this repo directly

Most of the time you won't. corpus-explorer writes to it and corpus-api reads from it. You interact directly when:

  • Updating the taxonomy — adding a new domain, provision type, or entity requires editing taxonomy.yaml in this repo and re-tagging affected provisions.
  • Inspecting a specific bill — the files are plain text; you can read any bill or section directly without running the API.
  • Debugging a tagging issue — section frontmatter and body text are both here; you can see exactly what was tagged and what text it came from.
  • Updating tagging-prompt.md — the prompt is versioned here so any change is tracked in git history alongside the provisions it produced.

Quick Start: Common tasks

Updating the taxonomy

When adding new domains, provision types, or entities to taxonomy.yaml, the update must be atomic — commit the taxonomy change and the re-tagged provisions together so the repo is never in an inconsistent state.

  1. Add the new value to taxonomy.yaml — never rename or remove existing values
  2. Bump the version number by 1
  3. Run split_sections.py --retag in corpus-explorer to re-tag stale sections (those where taxonomy_version < new_version)
  4. Commit taxonomy.yaml and the re-tagged section files together

Re-scraping bill metadata

Section files (s{nn}.md) are stable once written — the bill text doesn't change. Bill metadata files (_bill.md) need periodic updates for any bill still moving through congress, because fields like second_reading, votes, and republic_act fill in over time.

Run house_measures.py --update in corpus-explorer to re-fetch _bill.md for all bills where status_order < 10.00.

Setup

No code to run, no dependencies to install. This repository is a structured data store — populate it using corpus-explorer and query it using corpus-api.

  • To populate it: use corpus-explorer
  • To query it: use corpus-api
  • To use it directly: read the files — they are plain Markdown with YAML frontmatter, readable by any tool that understands that format

PDFs are not tracked in git

Only .md and .json files are committed. PDF files and .md.cleaned marker files are excluded via .gitignore. PDFs are large and can always be re-downloaded from the HREP source; committing them would bloat the repo permanently. To regenerate Markdown from existing PDFs, re-run the Pass 2 scripts in corpus-explorer.

Schema reference

The full schema — including the exact format of every file, all frontmatter fields, and naming conventions — is in ph-corpus-schema.md. Always read it before modifying any file format or adding new fields.

About

A collection of Philippine legislative documents from the House of Representatives and Senate of the Philippines.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors