Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
04b011c
feat(code): add Python package scaffold with Typer CLI
effigies Mar 19, 2026
4e7baa4
chore(code): add uv.lock
effigies Mar 19, 2026
5fdb448
feat(code): add shared pipeline utilities
effigies Mar 19, 2026
49e0948
feat(code): migrate fetch_graphql stage
effigies Mar 19, 2026
ad3ab53
feat(code): migrate check_github stage
effigies Mar 19, 2026
c7e35c6
feat(code): migrate check_s3_version stage
effigies Mar 19, 2026
210feb4
feat(code): migrate check_s3_files stage
effigies Mar 19, 2026
8cd31bd
feat(code): migrate summarize stage
effigies Mar 19, 2026
0b5e00a
feat(code): wire all pipeline subcommands and implement run-all
effigies Mar 19, 2026
0736b76
test(code): add unit tests for dashboard-specific logic
effigies Mar 19, 2026
2da3d84
rf(code): add version constraint and document dependency boundary
effigies Mar 19, 2026
86af9a2
feat(code): migrate gen_data test data generators
effigies Mar 19, 2026
0ff9b4d
feat(code): wire gen-data subcommand
effigies Mar 19, 2026
b40bcbd
test(code): add integration test with hand-crafted fixture baseline
effigies Mar 19, 2026
d2e5e46
chore(code): remove old scripts/ directory
effigies Mar 19, 2026
3c14e59
build(ci): update workflow to use new CLI
effigies Mar 19, 2026
7de1599
doc: update README and CLAUDE.md for new package structure
effigies Mar 19, 2026
4e617c4
fix(code): use gql.Client directly instead of session context
effigies Mar 19, 2026
ba352f0
chore: Update ondiagnostics source
effigies Mar 19, 2026
f74d11d
chore: uv lock
effigies Mar 19, 2026
41e7a2a
chore: Move pyproject.toml into root to make more natural invocations
effigies Mar 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 4 additions & 16 deletions .github/workflows/update-data.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: Update Dashboard Data

on:
schedule:
- cron: '0 6 * * *' # Daily at 6am UTC
- cron: "0 6 * * *" # Daily at 6am UTC
workflow_dispatch:

permissions:
Expand All @@ -22,7 +22,7 @@ jobs:
uses: astral-sh/setup-uv@37802adc94f370d6bfd71619e3f0bf239e1f3b78 # v7.6.0

- name: Set up Python
run: uv python install 3.13
run: uv python install 3.14

- name: Restore bare repo cache
uses: actions/cache@cdf6c1fa76f9f475f3d7449005a359c84ca0f306 # v5.0.3
Expand All @@ -31,20 +31,8 @@ jobs:
key: git-repos-${{ github.run_id }}
restore-keys: git-repos-

- name: Fetch GraphQL data
run: uv run scripts/fetch_graphql.py --output-dir data

- name: Check GitHub mirrors
run: uv run scripts/check_github.py --output-dir data

- name: Check S3 versions
run: uv run scripts/check_s3_version.py --output-dir data

- name: Check S3 files
run: uv run scripts/check_s3_files.py --output-dir data --cache-dir ~/.cache/openneuro-dashboard/repos

- name: Generate summary
run: uv run scripts/summarize.py --output-dir data
- name: Run pipeline
run: uv run openneuro-dashboard run-all

- name: Commit and push if changed
run: |
Expand Down
45 changes: 28 additions & 17 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ OpenNeuro Dataset Monitor — a static HTML/JS dashboard with a Python data pipe

## Architecture

**Data Pipeline** (5-stage ETL):
**Data Pipeline** (5-stage ETL, implemented in `code/src/openneuro_dashboard/`):

```
fetch_graphql.pycheck_github.pycheck_s3_version.pycheck_s3_files.py → summarize.py
fetch-graphqlcheck-githubcheck-s3-versioncheck-s3-files → summarize
```

Each stage reads outputs from previous stages and can be run independently.
Expand All @@ -20,28 +20,39 @@ Each stage reads outputs from previous stages and can be run independently.

**Data**: Pipeline outputs go to `data/` as JSON. The dashboard loads these via fetch. Schema defined in `schema/openneuro-dashboard.yaml` (LinkML, version 1.0.0).

## Running Scripts
## Running the Pipeline

All Python scripts use **`uv`** with inline PEP 723 dependency declarations (no requirements.txt or pyproject.toml). Run with:
The pipeline is an installable Python package with a `openneuro-dashboard` CLI:

```bash
uv run scripts/fetch_graphql.py
uv run scripts/check_github.py
uv run scripts/check_s3_version.py
uv run scripts/check_s3_files.py --cache-dir ~/.cache/openneuro-dashboard/repos
cd code
uv sync
uv run openneuro-dashboard run-all --output-dir ../data
```

Test data generators:
Individual stages:

```bash
cd code
uv run openneuro-dashboard fetch-graphql --output-dir ../data
uv run openneuro-dashboard check-github --output-dir ../data
uv run openneuro-dashboard check-s3-version --output-dir ../data
uv run openneuro-dashboard check-s3-files --output-dir ../data --cache-dir ~/.cache/openneuro-dashboard/repos
uv run openneuro-dashboard summarize --output-dir ../data
```

Test data generation:

```bash
uv run scripts/gen_data/graphql.py
uv run scripts/gen_data/github.py
uv run scripts/gen_data/s3_version.py
cd code
uv run openneuro-dashboard gen-data --output-dir ../data --seed 42
```

After running either the full or test scripts, aggregate summary data with:
## Running Tests

```bash
uv run scripts/summarize.py
cd code
uv run --group test pytest -v
```

## Serving the Dashboard
Expand All @@ -54,7 +65,7 @@ python -m http.server 8000
## Key Conventions

- Dataset IDs match pattern `^ds[0-9]{6}$`
- All output JSON files include `schemaVersion: "1.1.0"` (from `scripts/utils.py:SCHEMA_VERSION`)
- All output JSON files include `schemaVersion: "1.1.0"` (from `code/src/openneuro_dashboard/utils.py:SCHEMA_VERSION`)
- Snapshot metadata and file listings are immutable; registry, check results, and summary are mutable
- Scripts use async I/O (asyncio) and include `--validate` flags for data consistency checking
- Python requires >=3.13
- Pipeline modules use async I/O (asyncio) and include `--validate` flags for data consistency checking
- Python requires >=3.14
114 changes: 40 additions & 74 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,13 @@ A dashboard for tracking synchronization status of OpenNeuro datasets across Gra
The monitoring system uses a multi-stage pipeline that generates static JSON files consumed by a client-side dashboard:

```
GraphQLGitHub Check → S3 Version → Git Trees → S3 Diff → Summarize → Dashboard
fetch-graphqlcheck-github → check-s3-version → check-s3-files → summarize
```

Each stage reads from previous stages and writes new check files, allowing incremental updates and independent execution.

The pipeline is implemented as an installable Python package under `code/`, exposing an `openneuro-dashboard` CLI.

### Data Model

All data files (aspirationally) follow a versioned schema defined in `schema/openneuro-dashboard.yaml` (LinkML format).
Expand Down Expand Up @@ -45,13 +47,15 @@ data/datasets/{id}/
### Check Logic

#### GitHub Check

- Uses `git ls-remote --symref` to fetch all refs
- Validates:
- All snapshot tags exist on GitHub
- HEAD points to latest snapshot
- Commit SHAs match GraphQL data

#### S3 Version Check

- Fetches `dataset_description.json` from S3
- Extracts version from `DatasetDOI` field
- **Edge cases**:
Expand All @@ -64,6 +68,7 @@ data/datasets/{id}/
- All other cases allow file comparison with assumed version

#### S3 File Diff

- Compares S3 file listing against git tree
- Uses version from `s3-version.json` (either from DOI or assumed latest)
- Skipped if S3 is blocked (403)
Expand All @@ -72,83 +77,68 @@ data/datasets/{id}/
### Status Values

**Per-check statuses**:

- `ok`: Check passed
- `warning`: Minor issues (e.g., assumed version, HEAD mismatch)
- `error`: Check failed or blocked
- `version-mismatch`: S3 DOI version ≠ latest snapshot
- `pending`: Check not yet run

**Special flags**:
- `s3Blocked: true` in summary indicates 403 error (shows lock icon 🔒)

## Running the Pipeline

Some scripts declare dependencies in their headers.
The simplest way to run these scripts is `uv run`.
- `s3Blocked: true` in summary indicates 403 error (shows lock icon)

### Stage 1: Fetch GraphQL Data
## Setup

```bash
uv run scripts/fetch_graphql.py --output-dir data
uv sync
```

Queries OpenNeuro GraphQL API for all public datasets and their snapshots. Creates:
- `datasets-registry.json`
- Per-dataset `snapshots.json` and `snapshots/{tag}/metadata.json`
Requires Python 3.14+.

**Options**:
- `--page-size N`: Datasets per GraphQL page (default: 100)
- `--prefetch N`: Pages to buffer (default: 2)
- `--verbose`: Detailed logging
## Running the Pipeline

### Stage 2: Check GitHub Mirrors
### Full Pipeline

```bash
uv run scripts/check_github.py --output-dir data
uv run openneuro-dashboard run-all
```

Validates GitHub mirror status for all datasets.

**Options**:
- `--concurrency N`: Parallel git operations (default: 10)
- `--validate`: Run post-check validation
- `--verbose`: Detailed logging

### Stage 3: Check S3 Versions
### Individual Stages

```bash
uv run scripts/check_s3_version.py --output-dir data
```
# Stage 1: Fetch GraphQL data
uv run openneuro-dashboard fetch-graphql

Fetches `dataset_description.json` from S3 and extracts versions.
# Stage 2: Check GitHub mirrors
uv run openneuro-dashboard check-github

**Options**:
- `--concurrency N`: Parallel HTTP requests (default: 20)
- `--validate`: Run post-check validation
# Stage 3: Check S3 versions
uv run openneuro-dashboard check-s3-version

### Stage 4: Fetch Git File Trees
# Stage 4: Check S3 files
uv run openneuro-dashboard check-s3-files --cache-dir ~/.cache/openneuro-dashboard/repos

(Not yet implemented - currently using generated test data)

Should fetch file listings from git for each snapshot tag:
```bash
git clone --bare --depth=1 --filter=blob:none --branch {tag} {repo}
git ls-files --with-tree {tag}
# Stage 5: Summarize
uv run openneuro-dashboard summarize
```

### Stage 5: Generate S3 File Diffs
Common options:

(Not yet implemented - currently using generated test data)
- `--verbose` / `-v`: Enable verbose output
- `--max-datasets N`: Limit number of datasets (for `fetch-graphql` and `run-all`)

Should compare S3 file listings against git trees and create `s3-diff.json`.

### Stage 6: Summarize
### Generating Test Data

```bash
uv run scripts/summarize.py --output-dir data
uv run openneuro-dashboard gen-data --num-datasets 50 --seed 42
```

Reads all check files and generates `all-datasets.json` with aggregated results.
## Running Tests

```bash
uv run --group test pytest -v
```

## Dashboard

Expand All @@ -165,7 +155,6 @@ Static HTML/CSS/JS dashboard served from the repository root.
### Serving

```bash
# Python
python -m http.server 8000
```

Expand All @@ -174,43 +163,29 @@ Navigate to `http://localhost:8000`
### Features

**Main view**:

- Sortable/filterable dataset table
- Summary statistics by status
- Search by dataset ID
- Color-coded status badges
- Lock icons (🔒) for blocked S3 datasets
- Lock icons for blocked S3 datasets

**Detail view**:

- Snapshot history
- Detailed check results with expandable sections
- File diff viewer (when mismatches exist)
- Lazy-loaded file listings

## Test Data Generation

Located in `scripts/gen_data/`, these scripts simulate pipeline stages for development:

```bash
python scripts/gen_data/graphql.py
python scripts/gen_data/github.py
python scripts/gen_data/s3_version.py
python scripts/gen_data/s3_version.py
```

## Development Workflow

1. **Add real pipeline stage**: Implement stage script (e.g., `fetch_git_trees.py`)
2. **Update test generator**: Modify corresponding `gen_data/*.py` to match
3. **Test incrementally**: Run new stage, then existing summarize + dashboard
4. **Validate**: Use `--validate` flags to check data consistency

## Data Immutability

**Immutable** (never changes once created):

- `snapshots/{tag}/metadata.json`
- `snapshots/{tag}/files.json`

**Mutable** (updated on each check run):

- `datasets-registry.json`
- `github.json`
- `s3-version.json`
Expand All @@ -219,15 +194,6 @@ python scripts/gen_data/s3_version.py

This allows caching of snapshot data while keeping check results fresh.

## Future Enhancements

- [ ] Implement git tree fetching (stage 4)
- [ ] Implement S3 file diff (stage 5)
- [ ] Scripts to auto-fix issues based on outputs
- [ ] Schedule data updates in CI
- [ ] Track historical trends
- [ ] Integration with GitHub issues to track known problems

## Schema Evolution

The LinkML schema (`schema/openneuro-dashboard.yaml`) includes a `schemaVersion` field in all data files. When making breaking changes:
Expand Down
20 changes: 20 additions & 0 deletions code/src/openneuro_dashboard/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
"""OpenNeuro Dashboard data-population tools.

ondiagnostics modules used
--------------------------
- ``ondiagnostics.graphql``: GraphQLResponse, PageInfo, create_client, get_page
(pagination over the OpenNeuro GraphQL API in fetch_graphql.py)
- ``ondiagnostics.subprocs``: git
(async subprocess wrapper for bare-clone / fetch in check_s3_files.py)
- ``ondiagnostics.tasks.git``: list_refs
(remote ref listing for GitHub mirror checks in check_github.py)

Dashboard-specific logic
------------------------
- fetch_graphql: writes datasets-registry.json and per-snapshot metadata
- check_github: compares registry against GitHub mirror refs
- check_s3_version: extracts S3 export version from dataset_description.json
- check_s3_files: diffs git tree against S3 object listing
- summarize: aggregates per-dataset check results into all-datasets.json
- utils: shared JSON I/O helpers and schema version constant
"""
Loading