Skip to content

Add pipeline-spec: v5 cross-match dataset methodology + script spec#1

Open
Keeeeeeeks wants to merge 1 commit into
mainfrom
add-pipeline-spec
Open

Add pipeline-spec: v5 cross-match dataset methodology + script spec#1
Keeeeeeeks wants to merge 1 commit into
mainfrom
add-pipeline-spec

Conversation

@Keeeeeeeks

@Keeeeeeeks Keeeeeeeks commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds pipeline-spec/ folder. It's a specification for building a cluster-scale, cross-matched, multimodal dataset (stars + galaxies + AGN; images + spectra + light curves + tabular) to extend the Platonic Universe program, by extending and hardening the existing v4/OmniSky pipeline.

  • 01-methodology.md: global HEALPix-order-29 object identity; a multi-submission consistency model for shared-filesystem clusters (disjoint partitions, atomic writes, optional claim ledger, global per-service rate limits, single HF uploader); lsdb/HATS access over HF for MultimodalUniverse sources; false-match logic to account for real or apparent (parallax) displacement; CPU/GPU/network resource model; DONE-marker validation logic; and a per-source Data Access Matrix.
  • 02-script-spec.md: concrete module/CLI layout, partition-aware SLURM arrays on CPU, and acceptance criteria.
  • 03-v4-walkthrough.md: doc explaining v4 OmniSky logic, and explaining how v5 builds on top of v4.
  • 04-downstream-analysis-interface.md — the PRH (mutual-kNN/CKA) analysis. Note that this is lower fidelity than the above docs, but will be handled by other team members, so felt worth to include and compare notes.

Review focus

Drafts for review. Feedback most useful on the false-match protocol (§3.4.1), the consistency model (§3.2), and the resource/throughput model (§3.8).

Specification for building a cluster-scale, cross-matched multimodal dataset
(stars + galaxies + AGN; images + spectra + light curves + tabular) to extend the
Platonic Universe program, by extending and hardening the existing v4/OmniSky pipeline.

Covers: global HEALPix-order-29 object identity; a multi-submission consistency model
for shared-filesystem clusters (disjoint partitions, atomic writes, optional claim
ledger, global per-service rate limits, single HF uploader); lsdb/HATS access over
HuggingFace for MultimodalUniverse sources; a motion-aware false-match protocol that
accounts for real (proper-motion) and apparent (parallax) displacement; a CPU/GPU/
network resource model; integrity-checked DONE-marker resume with a release audit; and
a per-source data access matrix. Drafts for review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant