Skip to content

Count data loading and genotyping initialization #47

@m-murphy

Description

@m-murphy

Parent

#45

What to build

Extend the data-loading path so users can supply per-allele read counts instead of collapsing to presence/absence at load time.

Long-form loading accepts an aggregate argument ("binary" default | "count") and an optional reads column (default 1 per row when absent). Count mode sums reads per (sample, locus, allele) into non-negative integer barcodes. Delimited wide loading passes aggregate through after pivoting to long form. Long-form export / round-trip reflects counts when data were loaded in count mode.

The returned data object records aggregation mode (e.g. metadata field) so downstream code can distinguish count vs binary barcodes. Validation: clear error when aggregate = "binary" and duplicate (sample, locus, allele) rows exist; document that counts greater than 1 require count mode.

Update genotyping initialization so allele presence means count > 0 (not sum of read depths): observed COI hint, Jaccard similarity, and clustering-based initialization all use presence semantics consistent with count data. Uninformative loci (single allele) continue to be removed at load time.

Acceptance criteria

  • load_long_form_data(..., aggregate = "binary" | "count") with optional reads column; default behavior unchanged (aggregate = "binary")
  • load_delimited_data forwards aggregate to long-form loader
  • Count aggregation produces correct summed barcodes; binary aggregation unchanged
  • Returned data object includes aggregation mode metadata
  • Duplicate-row error in binary mode with actionable message
  • Long-form export preserves counts in count mode
  • Genotyping observed-COI initialization uses alleles with count > 0 per locus (max over loci), not sum of depths
  • Jaccard / clustering init treat allele as present when count > 0
  • R testthat coverage in test-utils.R (or new test file) for aggregate, reads, validation, round-trip

Blocked by

None - can start immediately

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestready-for-agentFully specified, ready for an AFK agent

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions