Add pipeline-spec: v5 cross-match dataset methodology + script spec by Keeeeeeeks · Pull Request #1 · ksd3/layerwise-analysis

Keeeeeeeks · 2026-06-25T15:49:06Z

Summary

Adds pipeline-spec/ folder. It's a specification for building a cluster-scale, cross-matched, multimodal dataset (stars + galaxies + AGN; images + spectra + light curves + tabular) to extend the Platonic Universe program, by extending and hardening the existing v4/OmniSky pipeline.

01-methodology.md: global HEALPix-order-29 object identity; a multi-submission consistency model for shared-filesystem clusters (disjoint partitions, atomic writes, optional claim ledger, global per-service rate limits, single HF uploader); lsdb/HATS access over HF for MultimodalUniverse sources; false-match logic to account for real or apparent (parallax) displacement; CPU/GPU/network resource model; DONE-marker validation logic; and a per-source Data Access Matrix.
02-script-spec.md: concrete module/CLI layout, partition-aware SLURM arrays on CPU, and acceptance criteria.
03-v4-walkthrough.md: doc explaining v4 OmniSky logic, and explaining how v5 builds on top of v4.
04-downstream-analysis-interface.md — the PRH (mutual-kNN/CKA) analysis. Note that this is lower fidelity than the above docs, but will be handled by other team members, so felt worth to include and compare notes.

Review focus

Drafts for review. Feedback most useful on the false-match protocol (§3.4.1), the consistency model (§3.2), and the resource/throughput model (§3.8).

Specification for building a cluster-scale, cross-matched multimodal dataset (stars + galaxies + AGN; images + spectra + light curves + tabular) to extend the Platonic Universe program, by extending and hardening the existing v4/OmniSky pipeline. Covers: global HEALPix-order-29 object identity; a multi-submission consistency model for shared-filesystem clusters (disjoint partitions, atomic writes, optional claim ledger, global per-service rate limits, single HF uploader); lsdb/HATS access over HuggingFace for MultimodalUniverse sources; a motion-aware false-match protocol that accounts for real (proper-motion) and apparent (parallax) displacement; a CPU/GPU/ network resource model; integrity-checked DONE-marker resume with a release audit; and a per-source data access matrix. Drafts for review.

Keeeeeeeks mentioned this pull request Jun 26, 2026

Add implementation plan: review + phased build plan #2

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pipeline-spec: v5 cross-match dataset methodology + script spec#1

Add pipeline-spec: v5 cross-match dataset methodology + script spec#1
Keeeeeeeks wants to merge 1 commit into
mainfrom
add-pipeline-spec

Keeeeeeeks commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Keeeeeeeks commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review focus

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Keeeeeeeks commented Jun 25, 2026 •

edited

Loading