Add pipeline-spec: v5 cross-match dataset methodology + script spec#1
Open
Keeeeeeeks wants to merge 1 commit into
Open
Add pipeline-spec: v5 cross-match dataset methodology + script spec#1Keeeeeeeks wants to merge 1 commit into
Keeeeeeeks wants to merge 1 commit into
Conversation
Specification for building a cluster-scale, cross-matched multimodal dataset (stars + galaxies + AGN; images + spectra + light curves + tabular) to extend the Platonic Universe program, by extending and hardening the existing v4/OmniSky pipeline. Covers: global HEALPix-order-29 object identity; a multi-submission consistency model for shared-filesystem clusters (disjoint partitions, atomic writes, optional claim ledger, global per-service rate limits, single HF uploader); lsdb/HATS access over HuggingFace for MultimodalUniverse sources; a motion-aware false-match protocol that accounts for real (proper-motion) and apparent (parallax) displacement; a CPU/GPU/ network resource model; integrity-checked DONE-marker resume with a release audit; and a per-source data access matrix. Drafts for review.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
pipeline-spec/folder. It's a specification for building a cluster-scale, cross-matched, multimodal dataset (stars + galaxies + AGN; images + spectra + light curves + tabular) to extend the Platonic Universe program, by extending and hardening the existingv4/OmniSkypipeline.01-methodology.md: global HEALPix-order-29 object identity; a multi-submission consistency model for shared-filesystem clusters (disjoint partitions, atomic writes, optional claim ledger, global per-service rate limits, single HF uploader); lsdb/HATS access over HF for MultimodalUniverse sources; false-match logic to account for real or apparent (parallax) displacement; CPU/GPU/network resource model; DONE-marker validation logic; and a per-source Data Access Matrix.02-script-spec.md: concrete module/CLI layout, partition-aware SLURM arrays on CPU, and acceptance criteria.03-v4-walkthrough.md: doc explaining v4 OmniSky logic, and explaining how v5 builds on top of v4.04-downstream-analysis-interface.md— the PRH (mutual-kNN/CKA) analysis. Note that this is lower fidelity than the above docs, but will be handled by other team members, so felt worth to include and compare notes.Review focus
Drafts for review. Feedback most useful on the false-match protocol (§3.4.1), the consistency model (§3.2), and the resource/throughput model (§3.8).