Source Cooperative Data Catalog

Each JSON file here is one STAC Collection for one Source Cooperative dataset.

Layout

catalog/
  {publisher}/{product}.json          # 1:1 product = dataset (common case)
  {publisher}/{product}-{slug}.json   # sub-dataset within a multi-dataset product
  _keywords.jsonl                     # canonical keyword vocabulary
  _synonyms.json                      # observed-keyword → canonical map
  _log.jsonl                          # per-action pipeline log

Most Source Coop products map 1:1 to a catalog entry. Some products contain multiple distinct datasets (e.g. harvard-lil/gov-data is an archive of ~300k data.gov datasets). These get decomposed into separate entries as we encounter them.

Identifiers

Every entry carries account_id and product_id as the first two project-specific fields, matching the directory layout. The STAC-required id field is set to {account_id}/{product_id} — globally unique across Source Cooperative.

State funnel

Every entry has a _state field tracking pipeline progress:

State	Written by	What's in the file
seed	`21-gen-seed`	Inventory aggregates (`file_count`, `total_bytes`, `exts`), API metadata (title, description, tags, mirrors, dates, visibility)
orphan	`21-gen-seed`	Inventory aggregates only; the bucket has this repo but the source.coop API does not. Dead-end; dashboard counts these.
drafted	`41-compose-collection`	Full STAC Collection assembled from cached README + probe + upstream STAC + AI synthesis
reviewed	(manual)	Human-confirmed; never overwritten by automation

Locked states: orphan, drafted, reviewed. Gather and compose scripts refuse to clobber them. --redraft on compose bypasses drafted but not reviewed or orphan.

Entries only move forward. If an early-stage script is re-run against a later-state entry, it logs a locked line and exits 0.

Gather caches (not in the entry)

Raw inputs collected by the gather stage live under cache/, not in the catalog entry. compose_collection reads these when assembling the final STAC:

cache/readme/{pub}/{prod}/<filename>     # README bytes (directory per repo)
cache/probe/{pub}/{prod}.jsonl           # probe summary (first line) + per-file
cache/stac/{pub}/{prod}/…                # upstream STAC doc(s) mirrored from S3
cache/ai/{pub}/{prod}.jsonl              # AI response cache (30-day TTL)

This keeps the catalog entries clean: catalog.jsonl is just jq -c '.' catalog/*/*.json with no strip step, because working data never landed in the entries in the first place. The top-level readme field (concatenated README text with path headers) is intentional — it supports full-text search over the compiled catalog.

Entries describe, they don't enumerate

A catalog entry describes the dataset; it never manifests every file or item in it (that scales to millions and belongs in the source's own collection.json). Compose keeps only semantic upstream links and drops item/child enumeration — we point at the source via self/via — and stores a partitioned table's table:columns once at the collection level (41:dedup_asset_columns), keeping per-asset only the schemas that genuinely differ.

What counts as a separate dataset?

A dataset warrants its own catalog entry when:

Different provenance: different original source, producer, or collection method
Independent extent: could meaningfully have its own spatial and/or temporal bounds
Discoverable unit: someone searching for data would expect it as a distinct result

This rubric is intentionally incomplete. We develop it case-by-case as we encounter ambiguous products. When in doubt, start with one entry per product and flag it for review.

Known multi-dataset products

Product	Situation	Status
`harvard-lil/gov-data`	~300k data.gov archives in BagIt format	Not yet decomposed

Publishing model

These JSON files are the catalog's resting artifacts — work has been put into collecting, creating, cleaning, and verifying, but they're a best-effort record, not blessed by the data owner. We maintain catalog/{publisher}/{product}.json here; we do not push generated collection.json files back to the data buckets. At most, we offer a smaller version or backport to the repo owner, who uploads it themselves.

Scripts

See /pipeline.dot at the repo root for the dependency graph and /README.md for usage. Libraries live in scripts/lib/ (not runnable); numbered scripts (NN-name.py) are the pipeline stages.

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
000123		000123
abry-tudelft		abry-tudelft
addresscloud		addresscloud
agentmorris		agentmorris
alexgleith		alexgleith
alliance-bioversity-international-ciat		alliance-bioversity-international-ciat
alukach		alukach
amitbajaj		amitbajaj
asterisk-labs		asterisk-labs
ausantarctic		ausantarctic
auspatious		auspatious
avikertesz		avikertesz
avineon-tensing		avineon-tensing
benchmark		benchmark
berkeley-dse		berkeley-dse
bkr		bkr
boston-university		boston-university
brunosan		brunosan
caires-tudelft		caires-tudelft
calebrob6		calebrob6
carbonplan		carbonplan
cassiebuhler		cassiebuhler
catalystcoop		catalystcoop
cboettig		cboettig
cecil		cecil
cholmes		cholmes
clarkcga		clarkcga
clay		clay
cr458		cr458
csaybar		csaybar
cworthy		cworthy
dataforcanada		dataforcanada
diegovd		diegovd
dnwaeze		dnwaeze
dynamical		dynamical
e4drr-project		e4drr-project
earthblox		earthblox
earthgenome		earthgenome
eco4cast		eco4cast
ecovoice		ecovoice
eeholmes		eeholmes
englacial		englacial
epoch		epoch
espm-288		espm-288
fiboa		fiboa
fika		fika
firststreet		firststreet
fish-pace		fish-pace
forestsignal		forestsignal
ftw		ftw
fused		fused
geographyis		geographyis
geospatialml		geospatialml
geovibes		geovibes
giswqs		giswqs
google-research-open-buildings		google-research-open-buildings
govscape		govscape
harvard-lil		harvard-lil
hdx		hdx
henryspatialanalysis		henryspatialanalysis
hirooimaki		hirooimaki
iceberg		iceberg
idi		idi
jacobsn		jacobsn
jcushman		jcushman
jianbo		jianbo
joshmoore		joshmoore
jrc-lucas		jrc-lucas
jwasserman		jwasserman
kbgg		kbgg
kerner-lab		kerner-lab
khvzix		khvzix
krishnaglodha		krishnaglodha
ksa		ksa
ktyle		ktyle
kylebarron		kylebarron
luddaludwig		luddaludwig
m-mohr		m-mohr
major-tom		major-tom
malariaatlas		malariaatlas
mapterhorn		mapterhorn
maxar		maxar
mcgeo		mcgeo
mcox		mcox
mdsumner		mdsumner
michelle		michelle
mindearth		mindearth
mlcommons		mlcommons
mvrl		mvrl
ncar		ncar
nishadhka		nishadhka
nlebovits		nlebovits
notoncebut2x		notoncebut2x
ome		ome
ondata		ondata
opengeos		opengeos
opsis-oxford		opsis-oxford
ordnancesurvey		ordnancesurvey
pacificspatial		pacificspatial
pangeo		pangeo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Source Cooperative Data Catalog

Layout

Identifiers

State funnel

Gather caches (not in the entry)

Entries describe, they don't enumerate

What counts as a separate dataset?

Known multi-dataset products

Publishing model

Scripts

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors

Uh oh!

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Source Cooperative Data Catalog

Layout

Identifiers

State funnel

Gather caches (not in the entry)

Entries describe, they don't enumerate

What counts as a separate dataset?

Known multi-dataset products

Publishing model

Scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors

Uh oh!

Packages