Parent
#45
What to build
Extend the data-loading path so users can supply per-allele read counts instead of collapsing to presence/absence at load time.
Long-form loading accepts an aggregate argument ("binary" default | "count") and an optional reads column (default 1 per row when absent). Count mode sums reads per (sample, locus, allele) into non-negative integer barcodes. Delimited wide loading passes aggregate through after pivoting to long form. Long-form export / round-trip reflects counts when data were loaded in count mode.
The returned data object records aggregation mode (e.g. metadata field) so downstream code can distinguish count vs binary barcodes. Validation: clear error when aggregate = "binary" and duplicate (sample, locus, allele) rows exist; document that counts greater than 1 require count mode.
Update genotyping initialization so allele presence means count > 0 (not sum of read depths): observed COI hint, Jaccard similarity, and clustering-based initialization all use presence semantics consistent with count data. Uninformative loci (single allele) continue to be removed at load time.
Acceptance criteria
Blocked by
None - can start immediately
Parent
#45
What to build
Extend the data-loading path so users can supply per-allele read counts instead of collapsing to presence/absence at load time.
Long-form loading accepts an
aggregateargument ("binary"default |"count") and an optionalreadscolumn (default 1 per row when absent). Count mode sums reads per (sample, locus, allele) into non-negative integer barcodes. Delimited wide loading passesaggregatethrough after pivoting to long form. Long-form export / round-trip reflects counts when data were loaded in count mode.The returned data object records aggregation mode (e.g. metadata field) so downstream code can distinguish count vs binary barcodes. Validation: clear error when
aggregate = "binary"and duplicate (sample, locus, allele) rows exist; document that counts greater than 1 require count mode.Update genotyping initialization so allele presence means
count > 0(not sum of read depths): observed COI hint, Jaccard similarity, and clustering-based initialization all use presence semantics consistent with count data. Uninformative loci (single allele) continue to be removed at load time.Acceptance criteria
load_long_form_data(..., aggregate = "binary" | "count")with optionalreadscolumn; default behavior unchanged (aggregate = "binary")load_delimited_dataforwardsaggregateto long-form loadertest-utils.R(or new test file) for aggregate, reads, validation, round-tripBlocked by
None - can start immediately