PRD: Count-based observation model (alternative to binary error process)

## Problem Statement

moire's current observation process treats genotyping data as binary presence/absence of alleles at each locus, typically derived from read counts via an upstream thresholding step. This discards information in the number of reads attributable to each allele. The model compensates with a simple per-allele error process (false positives and false negatives, scaled by locus width), but that process cannot distinguish a high-confidence allele call from a low-depth artifact, nor model dropout (amplification failure on truly present alleles) separately from contamination and sequencing-artifact false positives. Analysts who have raw or aggregated read counts must collapse them to presence/absence before running moire, losing signal that could improve estimates of complexity of infection (COI), population allele frequencies, and within-host relatedness.

## Solution

Add an alternative, opt-in count-based observation model that uses per-allele read counts as the primary input. False positives are modeled as noise reads on truly absent alleles (contamination / sequencing artifact). False negatives are modeled as dropout on truly present alleles (present in the latent genotype support set but emitting noise-level reads). The latent genotype remains a set of distinct alleles whose cardinality is bounded by COI—the same formulation that keeps transmission likelihood tractable with respect to COI. Transmission, relatedness, population structure, and COI priors are unchanged. Users select the observation model at MCMC runtime; data loading supports aggregating long-form input to per-allele counts instead of binary barcodes. The existing binary observation model remains the default for backward compatibility.

## User Stories

1. As a moire user with amplicon read-count data, I want to load per-allele counts directly, so that I do not have to apply a hard presence/absence threshold before inference.

2. As a moire user, I want to choose between binary and count observation models when running MCMC, so that I can compare approaches without changing my downstream analysis workflow.

3. As a moire user on the default binary path, I want existing behavior unchanged, so that published analyses and scripts continue to work without modification.

4. As a moire user loading long-form data, I want an optional `reads` column (defaulting to one read per row), so that I can supply counts either as repeated rows or explicit counts.

5. As a moire user loading long-form data, I want an `aggregate` option to produce count barcodes instead of binary barcodes, so that the internal representation matches the count observation model.

6. As a moire user loading delimited wide data, I want count aggregation to apply after pivoting to long form, so that wide-format workflows stay supported.

7. As a moire user running the count model, I want false positive reads modeled as Poisson noise on absent alleles, so that low-depth artifact alleles are treated as uncertain rather than definitive calls.

8. As a moire user running the count model, I want dropout modeled as a mixture on present alleles (signal vs noise), so that truly present alleles with zero or few reads remain plausible.

9. As a moire user, I want count-model FP and dropout rates to reuse the existing per-sample `eps_pos` and `eps_neg` MCMC parameters and Beta priors, so that tuning and interpretation stay familiar.

10. As a moire user, I want `eps_pos` mapped to a noise rate scaled by total locus depth and allele count, so that the FP parameter retains a per-locus expected-error interpretation similar to the binary model.

11. As a moire user, I want `eps_neg` mapped to a dropout probability on present alleles, so that the FN parameter retains a dropout interpretation similar to the binary model.

12. As a moire developer, I want observation likelihood isolated behind a small interface, so that binary and count models can be tested and extended without scattering conditionals through the MCMC chain.

13. As a moire developer, I want the binary observation logic moved behind that interface without behavior change, so that the refactor is safe before adding count logic.

14. As a moire developer, I want count observation likelihood computed per (sample, locus) from the latent support set and count vector, so that parallel observation likelihood evaluation stays embarrassingly parallel.

15. As a moire developer, I want the latent genotype sampler updated to use allele-specific FP/FN weights derived from count emissions, so that MH proposals for COI and sample-level parameters remain valid with correct adjustment terms.

16. As a moire developer, I want the combinatorial structure of latent genotype sampling preserved (FP/FN count constraints, uniform choice over combinations, log-prob correction), so that COI constraints and at-least-one-allele constraints still hold.

17. As a moire developer, I want transmission likelihood and P(any missing) logic untouched, so that COI marginalization properties of the current model are preserved.

18. As a moire developer, I want `observed_coi` initialization to use alleles with positive read counts (not sum of depths), so that naive COI hints remain sensible in count mode.

19. As a moire user validating models, I want a count-aware simulator that generates read counts from true strain allele allocations, so that I can run simulation–recovery tests.

20. As a moire user, I want clear errors when count barcodes are paired with the binary model or binary barcodes with the count model, so that misconfiguration is caught early.

21. As a moire user converting data back to long form, I want count-aware round-trip behavior, so that export and visualization reflect read counts where applicable.

22. As a moire developer, I want C++ unit tests for count log-likelihood on hand-computed cases, so that numerical correctness is verified without full MCMC runs.

23. As a moire developer, I want R integration tests that run short MCMC with count data, so that the Rcpp path and MultiVector layout remain compatible.

24. As a moire user reading documentation, I want `run_mcmc` and data-loading docs to describe `observation_model` and count aggregation, so that I can adopt the feature without reading C++.

25. As a moire user, I want a NEWS entry describing the new observation model, so that I know what changed in the release.

26. As a moire developer, I want mixture log-likelihoods for present alleles evaluated with log-sum-exp, so that underflow does not corrupt inference.

27. As a moire developer, I want Poisson means floored at a small epsilon, so that zero-depth loci do not produce numerical errors.

28. As a moire user with missing loci, I want missing loci to contribute zero observation likelihood as today, so that the missing-data mechanism is unchanged.

29. As a moire user with uninformative loci (single allele), I want them removed at load time as today, so that data hygiene is consistent across observation models.

30. As a moire developer, I want Jaccard similarity and clustering-based initialization to treat an allele as present when its count is greater than zero, so that initialization heuristics remain coherent in count mode.

## Implementation Decisions

### Observation model module (deep module)

Introduce an observation-model abstraction with a narrow interface:

- **Log-likelihood**: given latent genotype support set S, per-allele count vector, per-sample FP/dropout parameters, and locus context, return log P(observation | S, parameters).
- **Latent genotype proposal**: given observed counts, COI bound m, and FP/dropout parameters, return a proposed support set S and log probability for MH adjustment (same contract as today's `sample_latent_genotype`).
- **Simulation** (R layer may call R or C++): given true per-allele strain allocation or support, generate a count vector for validation.

Two implementations:

1. **Binary observation model** — encapsulates current TP/TN/FP/FN independent Bernoulli error logic with per-allele rates scaled by locus allele count. No change to inferred results after refactor.

2. **Count Poisson observation model (v1)** — v1 emission spec:

```
Absent (a ∉ S):     n_a ~ Poisson(λ_s / A)
Present (a ∈ S):    n_a ~ (1 - π_s) * Poisson(μ_a) + π_s * Poisson(λ_s / A)
                    μ_a = N / |S|
```

Where N = sum of counts at the locus, A = number of alleles at the locus.

Parameter mapping from existing MCMC state:

- π_s = eps_neg_s × max_eps_neg (dropout probability)
- λ_s = eps_pos_s × max_eps_pos × (N / A) (noise rate for FP mechanism)

Present-allele emissions use a two-component Poisson mixture marginalized in log-space (log-sum-exp). Absent-allele emissions use a single Poisson. v1 conditions on observed total depth N rather than modeling N separately.

The MCMC chain holds a unique pointer (or equivalent) to the selected observation model, constructed from a runtime `observation_model` parameter (`"binary"` | `"counts"`), default `"binary"`.

### Data loading module

Extend long-form loading with:

- `aggregate`: `"binary"` (default) | `"count"`.
- Optional `reads` column on long-form input; default 1 per row when absent.

Binary aggregation: allele present → 1, absent → 0 (current behavior).

Count aggregation: sum `reads` per (sample_id, locus, allele).

Wide delimited loading passes aggregation through to long-form loader.

Validation:

- Reject or error when `aggregate = "binary"` and duplicate (sample, locus, allele) rows exist without aggregation (or warn and sum—decision: error with clear message preferred for binary mode).
- Reject count barcodes (any value > 1) when `observation_model = "binary"`.
- Reject binary semantics when `observation_model = "counts"` and data are strictly 0/1 without documented threshold—at minimum, allow 0/1 counts but document that counts > 1 require count mode.

Internal data structure remains list with `data` (locus-indexed list of per-sample barcode vectors), `sample_ids`, `loci`, `is_missing`; barcode integers become non-negative counts in count mode.

### Genotyping data / observed COI

Genotyping data container continues to store per-(sample, locus, allele) integers. Update naive observed COI computation: for count mode, count alleles with n_a > 0 per locus, take max over loci (not sum of read depths).

Clustering initialization and Jaccard similarity: allele present if count > 0.

### Chain and sampler integration

`calculate_observation_likelihood` delegates to observation model log-likelihood.

All call sites of `sample_latent_genotype` pass through observation model (or observation model provides latent proposal).

`calc_transmission_process`, transmission updates, population frequency updates, COI updates structure unchanged; only observation likelihood and latent resampling paths differ by model.

Latent genotype storage format unchanged (sorted allele indices, -1 padding).

### Parameters and R API

Add `observation_model` to MCMC arguments and C++ Parameters, default `"binary"`.

Add `aggregate` to data loading functions, default `"binary"`.

Existing `eps_pos`, `eps_neg`, Beta priors, `max_eps_pos`, `max_eps_neg` retained; no new hyperparameters in v1.

`run_mcmc` passes `observation_model` through to Rcpp unchanged.

### Phased delivery (recommended PR sequence)

1. Observation model interface + binary refactor (no user-visible change).
2. Count data loading + validation.
3. Count Poisson log-likelihood.
4. Count-aware latent genotype sampler.
5. Simulator, docs, integration tests.

Each phase independently mergeable; phases 3–4 depend on 2.

### Numerical details

- Poisson log-PMF via stable lgamma formulation consistent with existing sampler math.
- Mixture for present alleles: log((1-π)·Pois(n|μ) + π·Pois(n|λ/A)).
- Floor μ and λ/A at small positive epsilon to avoid log(0).

## Testing Decisions

**What makes a good test:** Exercise externally observable behavior—loaded data shapes, likelihood values on fixed inputs, MCMC smoke runs, simulation–recovery trends—not internal class names or private helpers. Prefer hand-computed likelihood cases where closed forms are easy. Keep MCMC tests minimal (very low burnin/samples) to avoid CI timeouts; skip on CRAN consistent with existing `test-mcmc.R`.

**Modules to test:**

| Module | Test type | Prior art |
|--------|-----------|-----------|
| Observation model interface + binary wrapper | C++ unit tests | `cpp/tests/multivector_tests.cpp`, `cpp/tests/zeta_mobius_tests.cpp` style |
| Count Poisson log-likelihood | C++ unit tests with hand-computed log probs | New `observation_model` test file |
| Data loading (`aggregate`, `reads`) | R testthat | `tests/testthat/test-utils.R` |
| Latent genotype sampler (count weights) | C++ or R tests on small alleles + fixed counts | Sampler logic tests (new) |
| End-to-end MCMC count mode | R testthat smoke | `tests/testthat/test-mcmc.R` |
| Simulator counts | R testthat | `R/simulator.R` / existing simulator tests if any |

**Regression:** After binary refactor, verify binary observation likelihood matches pre-refactor values on fixed fixtures.

**Optional:** Simulation–recovery test with short MCMC on simulated count data; qualitative check that posteriors are sensible—not strict tolerance on stochastic output.

## Out of Scope

- Copy-number multiset latent state (per-strain copy counts per allele).
- HMC or NUTS for continuous parameters.
- Explicit marginalization over all support sets S (enumeration alternative to auxiliary latent sampling).
- NegBin overdispersion for signal or noise (Poisson-only in v1).
- Per-locus efficiency factors, per-allele artifact rates, or hierarchical calibration from control samples.
- Modeling total depth N as a separate stochastic node (v1 conditions on observed N).
- Changes to transmission likelihood, P(any missing), relatedness formulation, or population mixture structure.
- Automatic allele calling or threshold selection from counts (counts are input; model replaces post-hoc thresholding, not upstream variant calling).
- Breaking changes to default binary workflow.

## Further Notes

The count observation model is designed around preserving the latent genotype as a **support set of distinct alleles** bounded by COI, because transmission likelihood already marginalizes strain assignments and relatedness given that set. Read counts inform membership in S through the observation layer only; they do not require abandoning the COI-friendly transmission formulation.

Future v2 extensions (not this PRD): NegBin emissions, per-locus noise/dropout hierarchies, empirical noise calibration from mono-infection controls, optional `min_reads` for observed-COI initialization.

Default remains `observation_model = "binary"` and `aggregate = "binary"` so manuscript-era analyses and existing scripts are unaffected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRD: Count-based observation model (alternative to binary error process) #45

Problem Statement

Solution

User Stories

Implementation Decisions

Observation model module (deep module)

Data loading module

Genotyping data / observed COI

Chain and sampler integration

Parameters and R API

Phased delivery (recommended PR sequence)

Numerical details

Testing Decisions

Out of Scope

Further Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Module	Test type	Prior art
Observation model interface + binary wrapper	C++ unit tests	`cpp/tests/multivector_tests.cpp`, `cpp/tests/zeta_mobius_tests.cpp` style
Count Poisson log-likelihood	C++ unit tests with hand-computed log probs	New `observation_model` test file
Data loading (`aggregate`, `reads`)	R testthat	`tests/testthat/test-utils.R`
Latent genotype sampler (count weights)	C++ or R tests on small alleles + fixed counts	Sampler logic tests (new)
End-to-end MCMC count mode	R testthat smoke	`tests/testthat/test-mcmc.R`
Simulator counts	R testthat	`R/simulator.R` / existing simulator tests if any

PRD: Count-based observation model (alternative to binary error process) #45

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Observation model module (deep module)

Data loading module

Genotyping data / observed COI

Chain and sampler integration

Parameters and R API

Phased delivery (recommended PR sequence)

Numerical details

Testing Decisions

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions