OpenNeuroOrg · effigies · Mar 20, 2026 · Mar 19, 2026 · Mar 19, 2026 · Mar 19, 2026
diff --git a/.github/workflows/update-data.yml b/.github/workflows/update-data.yml
@@ -2,7 +2,7 @@ name: Update Dashboard Data
 
 on:
   schedule:
-    - cron: '0 6 * * *'  # Daily at 6am UTC
+    - cron: "0 6 * * *" # Daily at 6am UTC
   workflow_dispatch:
 
 permissions:
@@ -22,7 +22,7 @@ jobs:
         uses: astral-sh/setup-uv@37802adc94f370d6bfd71619e3f0bf239e1f3b78 # v7.6.0
 
       - name: Set up Python
-        run: uv python install 3.13
+        run: uv python install 3.14
 
       - name: Restore bare repo cache
         uses: actions/cache@cdf6c1fa76f9f475f3d7449005a359c84ca0f306 # v5.0.3
@@ -31,20 +31,8 @@ jobs:
           key: git-repos-${{ github.run_id }}
           restore-keys: git-repos-
 
-      - name: Fetch GraphQL data
-        run: uv run scripts/fetch_graphql.py --output-dir data
-
-      - name: Check GitHub mirrors
-        run: uv run scripts/check_github.py --output-dir data
-
-      - name: Check S3 versions
-        run: uv run scripts/check_s3_version.py --output-dir data
-
-      - name: Check S3 files
-        run: uv run scripts/check_s3_files.py --output-dir data --cache-dir ~/.cache/openneuro-dashboard/repos
-
-      - name: Generate summary
-        run: uv run scripts/summarize.py --output-dir data
+      - name: Run pipeline
+        run: uv run openneuro-dashboard run-all
 
       - name: Commit and push if changed
         run: |

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -8,10 +8,10 @@ OpenNeuro Dataset Monitor — a static HTML/JS dashboard with a Python data pipe
 
 ## Architecture
 
-**Data Pipeline** (5-stage ETL):
+**Data Pipeline** (5-stage ETL, implemented in `code/src/openneuro_dashboard/`):
 
 ```
-fetch_graphql.py → check_github.py → check_s3_version.py → check_s3_files.py → summarize.py
+fetch-graphql → check-github → check-s3-version → check-s3-files → summarize
 ```
 
 Each stage reads outputs from previous stages and can be run independently.
@@ -20,28 +20,39 @@ Each stage reads outputs from previous stages and can be run independently.
 
 **Data**: Pipeline outputs go to `data/` as JSON. The dashboard loads these via fetch. Schema defined in `schema/openneuro-dashboard.yaml` (LinkML, version 1.0.0).
 
-## Running Scripts
+## Running the Pipeline
 
-All Python scripts use **`uv`** with inline PEP 723 dependency declarations (no requirements.txt or pyproject.toml). Run with:
+The pipeline is an installable Python package with a `openneuro-dashboard` CLI:
 
 ```bash
-uv run scripts/fetch_graphql.py
-uv run scripts/check_github.py
-uv run scripts/check_s3_version.py
-uv run scripts/check_s3_files.py --cache-dir ~/.cache/openneuro-dashboard/repos
+cd code
+uv sync
+uv run openneuro-dashboard run-all --output-dir ../data
 ```
 
-Test data generators:
+Individual stages:
+
+```bash
+cd code
+uv run openneuro-dashboard fetch-graphql --output-dir ../data
+uv run openneuro-dashboard check-github --output-dir ../data
+uv run openneuro-dashboard check-s3-version --output-dir ../data
+uv run openneuro-dashboard check-s3-files --output-dir ../data --cache-dir ~/.cache/openneuro-dashboard/repos
+uv run openneuro-dashboard summarize --output-dir ../data
+```
+
+Test data generation:
+
 ```bash
-uv run scripts/gen_data/graphql.py
-uv run scripts/gen_data/github.py
-uv run scripts/gen_data/s3_version.py
+cd code
+uv run openneuro-dashboard gen-data --output-dir ../data --seed 42
 ```
 
-After running either the full or test scripts, aggregate summary data with:
+## Running Tests
 
 ```bash
-uv run scripts/summarize.py
+cd code
+uv run --group test pytest -v
 ```
 
 ## Serving the Dashboard
@@ -54,7 +65,7 @@ python -m http.server 8000
 ## Key Conventions
 
 - Dataset IDs match pattern `^ds[0-9]{6}$`
-- All output JSON files include `schemaVersion: "1.1.0"` (from `scripts/utils.py:SCHEMA_VERSION`)
+- All output JSON files include `schemaVersion: "1.1.0"` (from `code/src/openneuro_dashboard/utils.py:SCHEMA_VERSION`)
 - Snapshot metadata and file listings are immutable; registry, check results, and summary are mutable
-- Scripts use async I/O (asyncio) and include `--validate` flags for data consistency checking
-- Python requires >=3.13
+- Pipeline modules use async I/O (asyncio) and include `--validate` flags for data consistency checking
+- Python requires >=3.14
diff --git a/README.md b/README.md
@@ -9,11 +9,13 @@ A dashboard for tracking synchronization status of OpenNeuro datasets across Gra
 The monitoring system uses a multi-stage pipeline that generates static JSON files consumed by a client-side dashboard:
 
 ```
-GraphQL → GitHub Check → S3 Version → Git Trees → S3 Diff → Summarize → Dashboard
+fetch-graphql → check-github → check-s3-version → check-s3-files → summarize
 ```
 
 Each stage reads from previous stages and writes new check files, allowing incremental updates and independent execution.
 
+The pipeline is implemented as an installable Python package under `code/`, exposing an `openneuro-dashboard` CLI.
+
 ### Data Model
 
 All data files (aspirationally) follow a versioned schema defined in `schema/openneuro-dashboard.yaml` (LinkML format).
@@ -45,13 +47,15 @@ data/datasets/{id}/
 ### Check Logic
 
 #### GitHub Check
+
 - Uses `git ls-remote --symref` to fetch all refs
 - Validates:
   - All snapshot tags exist on GitHub
   - HEAD points to latest snapshot
   - Commit SHAs match GraphQL data
 
 #### S3 Version Check
+
 - Fetches `dataset_description.json` from S3
 - Extracts version from `DatasetDOI` field
 - **Edge cases**:
@@ -64,6 +68,7 @@ data/datasets/{id}/
 - All other cases allow file comparison with assumed version
 
 #### S3 File Diff
+
 - Compares S3 file listing against git tree
 - Uses version from `s3-version.json` (either from DOI or assumed latest)
 - Skipped if S3 is blocked (403)
@@ -72,83 +77,68 @@ data/datasets/{id}/
 ### Status Values
 
 **Per-check statuses**:
+
 - `ok`: Check passed
 - `warning`: Minor issues (e.g., assumed version, HEAD mismatch)
 - `error`: Check failed or blocked
 - `version-mismatch`: S3 DOI version ≠ latest snapshot
 - `pending`: Check not yet run
 
 **Special flags**:
-- `s3Blocked: true` in summary indicates 403 error (shows lock icon 🔒)
-
-## Running the Pipeline
 
-Some scripts declare dependencies in their headers.
-The simplest way to run these scripts is `uv run`.
+- `s3Blocked: true` in summary indicates 403 error (shows lock icon)
 
-### Stage 1: Fetch GraphQL Data
+## Setup
 
 ```bash
-uv run scripts/fetch_graphql.py --output-dir data
+uv sync
 ```
 
-Queries OpenNeuro GraphQL API for all public datasets and their snapshots. Creates:
-- `datasets-registry.json`
-- Per-dataset `snapshots.json` and `snapshots/{tag}/metadata.json`
+Requires Python 3.14+.
 
-**Options**:
-- `--page-size N`: Datasets per GraphQL page (default: 100)
-- `--prefetch N`: Pages to buffer (default: 2)
-- `--verbose`: Detailed logging
+## Running the Pipeline
 
-### Stage 2: Check GitHub Mirrors
+### Full Pipeline
 
 ```bash
-uv run scripts/check_github.py --output-dir data
+uv run openneuro-dashboard run-all
 ```
 
-Validates GitHub mirror status for all datasets.
-
-**Options**:
-- `--concurrency N`: Parallel git operations (default: 10)
-- `--validate`: Run post-check validation
-- `--verbose`: Detailed logging
-
-### Stage 3: Check S3 Versions
+### Individual Stages
 
 ```bash
-uv run scripts/check_s3_version.py --output-dir data
-```
+# Stage 1: Fetch GraphQL data
+uv run openneuro-dashboard fetch-graphql
 
-Fetches `dataset_description.json` from S3 and extracts versions.
+# Stage 2: Check GitHub mirrors
+uv run openneuro-dashboard check-github
 
-**Options**:
-- `--concurrency N`: Parallel HTTP requests (default: 20)
-- `--validate`: Run post-check validation
+# Stage 3: Check S3 versions
+uv run openneuro-dashboard check-s3-version
 
-### Stage 4: Fetch Git File Trees
+# Stage 4: Check S3 files
+uv run openneuro-dashboard check-s3-files --cache-dir ~/.cache/openneuro-dashboard/repos
 
-(Not yet implemented - currently using generated test data)
-
-Should fetch file listings from git for each snapshot tag:
-```bash
-git clone --bare --depth=1 --filter=blob:none --branch {tag} {repo}
-git ls-files --with-tree {tag}
+# Stage 5: Summarize
+uv run openneuro-dashboard summarize
 ```
 
-### Stage 5: Generate S3 File Diffs
+Common options:
 
-(Not yet implemented - currently using generated test data)
+- `--verbose` / `-v`: Enable verbose output
+- `--max-datasets N`: Limit number of datasets (for `fetch-graphql` and `run-all`)
 
-Should compare S3 file listings against git trees and create `s3-diff.json`.
-
-### Stage 6: Summarize
+### Generating Test Data
 
 ```bash
-uv run scripts/summarize.py --output-dir data
+uv run openneuro-dashboard gen-data --num-datasets 50 --seed 42
 ```
 
-Reads all check files and generates `all-datasets.json` with aggregated results.
+## Running Tests
+
+```bash
+uv run --group test pytest -v
+```
 
 ## Dashboard
 
@@ -165,7 +155,6 @@ Static HTML/CSS/JS dashboard served from the repository root.
 ### Serving
 
 ```bash
-# Python
 python -m http.server 8000
 ```
 
@@ -174,43 +163,29 @@ Navigate to `http://localhost:8000`
 ### Features
 
 **Main view**:
+
 - Sortable/filterable dataset table
 - Summary statistics by status
 - Search by dataset ID
 - Color-coded status badges
-- Lock icons (🔒) for blocked S3 datasets
+- Lock icons for blocked S3 datasets
 
 **Detail view**:
+
 - Snapshot history
 - Detailed check results with expandable sections
 - File diff viewer (when mismatches exist)
 - Lazy-loaded file listings
 
-## Test Data Generation
-
-Located in `scripts/gen_data/`, these scripts simulate pipeline stages for development:
-
-```bash
-python scripts/gen_data/graphql.py
-python scripts/gen_data/github.py
-python scripts/gen_data/s3_version.py
-python scripts/gen_data/s3_version.py
-```
-
-## Development Workflow
-
-1. **Add real pipeline stage**: Implement stage script (e.g., `fetch_git_trees.py`)
-2. **Update test generator**: Modify corresponding `gen_data/*.py` to match
-3. **Test incrementally**: Run new stage, then existing summarize + dashboard
-4. **Validate**: Use `--validate` flags to check data consistency
-
 ## Data Immutability
 
 **Immutable** (never changes once created):
+
 - `snapshots/{tag}/metadata.json`
 - `snapshots/{tag}/files.json`
 
 **Mutable** (updated on each check run):
+
 - `datasets-registry.json`
 - `github.json`
 - `s3-version.json`
@@ -219,15 +194,6 @@ python scripts/gen_data/s3_version.py
 
 This allows caching of snapshot data while keeping check results fresh.
 
-## Future Enhancements
-
-- [ ] Implement git tree fetching (stage 4)
-- [ ] Implement S3 file diff (stage 5)
-- [ ] Scripts to auto-fix issues based on outputs
-- [ ] Schedule data updates in CI
-- [ ] Track historical trends
-- [ ] Integration with GitHub issues to track known problems
-
 ## Schema Evolution
 
 The LinkML schema (`schema/openneuro-dashboard.yaml`) includes a `schemaVersion` field in all data files. When making breaking changes:

diff --git a/code/src/openneuro_dashboard/__init__.py b/code/src/openneuro_dashboard/__init__.py
@@ -0,0 +1,20 @@
+"""OpenNeuro Dashboard data-population tools.
+
+ondiagnostics modules used
+--------------------------
+- ``ondiagnostics.graphql``: GraphQLResponse, PageInfo, create_client, get_page
+  (pagination over the OpenNeuro GraphQL API in fetch_graphql.py)
+- ``ondiagnostics.subprocs``: git
+  (async subprocess wrapper for bare-clone / fetch in check_s3_files.py)
+- ``ondiagnostics.tasks.git``: list_refs
+  (remote ref listing for GitHub mirror checks in check_github.py)
+
+Dashboard-specific logic
+------------------------
+- fetch_graphql: writes datasets-registry.json and per-snapshot metadata
+- check_github:  compares registry against GitHub mirror refs
+- check_s3_version: extracts S3 export version from dataset_description.json
+- check_s3_files: diffs git tree against S3 object listing
+- summarize: aggregates per-dataset check results into all-datasets.json
+- utils: shared JSON I/O helpers and schema version constant
+"""