Problem Statement
moire's current observation process treats genotyping data as binary presence/absence of alleles at each locus, typically derived from read counts via an upstream thresholding step. This discards information in the number of reads attributable to each allele. The model compensates with a simple per-allele error process (false positives and false negatives, scaled by locus width), but that process cannot distinguish a high-confidence allele call from a low-depth artifact, nor model dropout (amplification failure on truly present alleles) separately from contamination and sequencing-artifact false positives. Analysts who have raw or aggregated read counts must collapse them to presence/absence before running moire, losing signal that could improve estimates of complexity of infection (COI), population allele frequencies, and within-host relatedness.
Solution
Add an alternative, opt-in count-based observation model that uses per-allele read counts as the primary input. False positives are modeled as noise reads on truly absent alleles (contamination / sequencing artifact). False negatives are modeled as dropout on truly present alleles (present in the latent genotype support set but emitting noise-level reads). The latent genotype remains a set of distinct alleles whose cardinality is bounded by COI—the same formulation that keeps transmission likelihood tractable with respect to COI. Transmission, relatedness, population structure, and COI priors are unchanged. Users select the observation model at MCMC runtime; data loading supports aggregating long-form input to per-allele counts instead of binary barcodes. The existing binary observation model remains the default for backward compatibility.
User Stories
-
As a moire user with amplicon read-count data, I want to load per-allele counts directly, so that I do not have to apply a hard presence/absence threshold before inference.
-
As a moire user, I want to choose between binary and count observation models when running MCMC, so that I can compare approaches without changing my downstream analysis workflow.
-
As a moire user on the default binary path, I want existing behavior unchanged, so that published analyses and scripts continue to work without modification.
-
As a moire user loading long-form data, I want an optional reads column (defaulting to one read per row), so that I can supply counts either as repeated rows or explicit counts.
-
As a moire user loading long-form data, I want an aggregate option to produce count barcodes instead of binary barcodes, so that the internal representation matches the count observation model.
-
As a moire user loading delimited wide data, I want count aggregation to apply after pivoting to long form, so that wide-format workflows stay supported.
-
As a moire user running the count model, I want false positive reads modeled as Poisson noise on absent alleles, so that low-depth artifact alleles are treated as uncertain rather than definitive calls.
-
As a moire user running the count model, I want dropout modeled as a mixture on present alleles (signal vs noise), so that truly present alleles with zero or few reads remain plausible.
-
As a moire user, I want count-model FP and dropout rates to reuse the existing per-sample eps_pos and eps_neg MCMC parameters and Beta priors, so that tuning and interpretation stay familiar.
-
As a moire user, I want eps_pos mapped to a noise rate scaled by total locus depth and allele count, so that the FP parameter retains a per-locus expected-error interpretation similar to the binary model.
-
As a moire user, I want eps_neg mapped to a dropout probability on present alleles, so that the FN parameter retains a dropout interpretation similar to the binary model.
-
As a moire developer, I want observation likelihood isolated behind a small interface, so that binary and count models can be tested and extended without scattering conditionals through the MCMC chain.
-
As a moire developer, I want the binary observation logic moved behind that interface without behavior change, so that the refactor is safe before adding count logic.
-
As a moire developer, I want count observation likelihood computed per (sample, locus) from the latent support set and count vector, so that parallel observation likelihood evaluation stays embarrassingly parallel.
-
As a moire developer, I want the latent genotype sampler updated to use allele-specific FP/FN weights derived from count emissions, so that MH proposals for COI and sample-level parameters remain valid with correct adjustment terms.
-
As a moire developer, I want the combinatorial structure of latent genotype sampling preserved (FP/FN count constraints, uniform choice over combinations, log-prob correction), so that COI constraints and at-least-one-allele constraints still hold.
-
As a moire developer, I want transmission likelihood and P(any missing) logic untouched, so that COI marginalization properties of the current model are preserved.
-
As a moire developer, I want observed_coi initialization to use alleles with positive read counts (not sum of depths), so that naive COI hints remain sensible in count mode.
-
As a moire user validating models, I want a count-aware simulator that generates read counts from true strain allele allocations, so that I can run simulation–recovery tests.
-
As a moire user, I want clear errors when count barcodes are paired with the binary model or binary barcodes with the count model, so that misconfiguration is caught early.
-
As a moire user converting data back to long form, I want count-aware round-trip behavior, so that export and visualization reflect read counts where applicable.
-
As a moire developer, I want C++ unit tests for count log-likelihood on hand-computed cases, so that numerical correctness is verified without full MCMC runs.
-
As a moire developer, I want R integration tests that run short MCMC with count data, so that the Rcpp path and MultiVector layout remain compatible.
-
As a moire user reading documentation, I want run_mcmc and data-loading docs to describe observation_model and count aggregation, so that I can adopt the feature without reading C++.
-
As a moire user, I want a NEWS entry describing the new observation model, so that I know what changed in the release.
-
As a moire developer, I want mixture log-likelihoods for present alleles evaluated with log-sum-exp, so that underflow does not corrupt inference.
-
As a moire developer, I want Poisson means floored at a small epsilon, so that zero-depth loci do not produce numerical errors.
-
As a moire user with missing loci, I want missing loci to contribute zero observation likelihood as today, so that the missing-data mechanism is unchanged.
-
As a moire user with uninformative loci (single allele), I want them removed at load time as today, so that data hygiene is consistent across observation models.
-
As a moire developer, I want Jaccard similarity and clustering-based initialization to treat an allele as present when its count is greater than zero, so that initialization heuristics remain coherent in count mode.
Implementation Decisions
Observation model module (deep module)
Introduce an observation-model abstraction with a narrow interface:
- Log-likelihood: given latent genotype support set S, per-allele count vector, per-sample FP/dropout parameters, and locus context, return log P(observation | S, parameters).
- Latent genotype proposal: given observed counts, COI bound m, and FP/dropout parameters, return a proposed support set S and log probability for MH adjustment (same contract as today's
sample_latent_genotype).
- Simulation (R layer may call R or C++): given true per-allele strain allocation or support, generate a count vector for validation.
Two implementations:
-
Binary observation model — encapsulates current TP/TN/FP/FN independent Bernoulli error logic with per-allele rates scaled by locus allele count. No change to inferred results after refactor.
-
Count Poisson observation model (v1) — v1 emission spec:
Absent (a ∉ S): n_a ~ Poisson(λ_s / A)
Present (a ∈ S): n_a ~ (1 - π_s) * Poisson(μ_a) + π_s * Poisson(λ_s / A)
μ_a = N / |S|
Where N = sum of counts at the locus, A = number of alleles at the locus.
Parameter mapping from existing MCMC state:
- π_s = eps_neg_s × max_eps_neg (dropout probability)
- λ_s = eps_pos_s × max_eps_pos × (N / A) (noise rate for FP mechanism)
Present-allele emissions use a two-component Poisson mixture marginalized in log-space (log-sum-exp). Absent-allele emissions use a single Poisson. v1 conditions on observed total depth N rather than modeling N separately.
The MCMC chain holds a unique pointer (or equivalent) to the selected observation model, constructed from a runtime observation_model parameter ("binary" | "counts"), default "binary".
Data loading module
Extend long-form loading with:
aggregate: "binary" (default) | "count".
- Optional
reads column on long-form input; default 1 per row when absent.
Binary aggregation: allele present → 1, absent → 0 (current behavior).
Count aggregation: sum reads per (sample_id, locus, allele).
Wide delimited loading passes aggregation through to long-form loader.
Validation:
- Reject or error when
aggregate = "binary" and duplicate (sample, locus, allele) rows exist without aggregation (or warn and sum—decision: error with clear message preferred for binary mode).
- Reject count barcodes (any value > 1) when
observation_model = "binary".
- Reject binary semantics when
observation_model = "counts" and data are strictly 0/1 without documented threshold—at minimum, allow 0/1 counts but document that counts > 1 require count mode.
Internal data structure remains list with data (locus-indexed list of per-sample barcode vectors), sample_ids, loci, is_missing; barcode integers become non-negative counts in count mode.
Genotyping data / observed COI
Genotyping data container continues to store per-(sample, locus, allele) integers. Update naive observed COI computation: for count mode, count alleles with n_a > 0 per locus, take max over loci (not sum of read depths).
Clustering initialization and Jaccard similarity: allele present if count > 0.
Chain and sampler integration
calculate_observation_likelihood delegates to observation model log-likelihood.
All call sites of sample_latent_genotype pass through observation model (or observation model provides latent proposal).
calc_transmission_process, transmission updates, population frequency updates, COI updates structure unchanged; only observation likelihood and latent resampling paths differ by model.
Latent genotype storage format unchanged (sorted allele indices, -1 padding).
Parameters and R API
Add observation_model to MCMC arguments and C++ Parameters, default "binary".
Add aggregate to data loading functions, default "binary".
Existing eps_pos, eps_neg, Beta priors, max_eps_pos, max_eps_neg retained; no new hyperparameters in v1.
run_mcmc passes observation_model through to Rcpp unchanged.
Phased delivery (recommended PR sequence)
- Observation model interface + binary refactor (no user-visible change).
- Count data loading + validation.
- Count Poisson log-likelihood.
- Count-aware latent genotype sampler.
- Simulator, docs, integration tests.
Each phase independently mergeable; phases 3–4 depend on 2.
Numerical details
- Poisson log-PMF via stable lgamma formulation consistent with existing sampler math.
- Mixture for present alleles: log((1-π)·Pois(n|μ) + π·Pois(n|λ/A)).
- Floor μ and λ/A at small positive epsilon to avoid log(0).
Testing Decisions
What makes a good test: Exercise externally observable behavior—loaded data shapes, likelihood values on fixed inputs, MCMC smoke runs, simulation–recovery trends—not internal class names or private helpers. Prefer hand-computed likelihood cases where closed forms are easy. Keep MCMC tests minimal (very low burnin/samples) to avoid CI timeouts; skip on CRAN consistent with existing test-mcmc.R.
Modules to test:
| Module |
Test type |
Prior art |
| Observation model interface + binary wrapper |
C++ unit tests |
cpp/tests/multivector_tests.cpp, cpp/tests/zeta_mobius_tests.cpp style |
| Count Poisson log-likelihood |
C++ unit tests with hand-computed log probs |
New observation_model test file |
Data loading (aggregate, reads) |
R testthat |
tests/testthat/test-utils.R |
| Latent genotype sampler (count weights) |
C++ or R tests on small alleles + fixed counts |
Sampler logic tests (new) |
| End-to-end MCMC count mode |
R testthat smoke |
tests/testthat/test-mcmc.R |
| Simulator counts |
R testthat |
R/simulator.R / existing simulator tests if any |
Regression: After binary refactor, verify binary observation likelihood matches pre-refactor values on fixed fixtures.
Optional: Simulation–recovery test with short MCMC on simulated count data; qualitative check that posteriors are sensible—not strict tolerance on stochastic output.
Out of Scope
- Copy-number multiset latent state (per-strain copy counts per allele).
- HMC or NUTS for continuous parameters.
- Explicit marginalization over all support sets S (enumeration alternative to auxiliary latent sampling).
- NegBin overdispersion for signal or noise (Poisson-only in v1).
- Per-locus efficiency factors, per-allele artifact rates, or hierarchical calibration from control samples.
- Modeling total depth N as a separate stochastic node (v1 conditions on observed N).
- Changes to transmission likelihood, P(any missing), relatedness formulation, or population mixture structure.
- Automatic allele calling or threshold selection from counts (counts are input; model replaces post-hoc thresholding, not upstream variant calling).
- Breaking changes to default binary workflow.
Further Notes
The count observation model is designed around preserving the latent genotype as a support set of distinct alleles bounded by COI, because transmission likelihood already marginalizes strain assignments and relatedness given that set. Read counts inform membership in S through the observation layer only; they do not require abandoning the COI-friendly transmission formulation.
Future v2 extensions (not this PRD): NegBin emissions, per-locus noise/dropout hierarchies, empirical noise calibration from mono-infection controls, optional min_reads for observed-COI initialization.
Default remains observation_model = "binary" and aggregate = "binary" so manuscript-era analyses and existing scripts are unaffected.
Problem Statement
moire's current observation process treats genotyping data as binary presence/absence of alleles at each locus, typically derived from read counts via an upstream thresholding step. This discards information in the number of reads attributable to each allele. The model compensates with a simple per-allele error process (false positives and false negatives, scaled by locus width), but that process cannot distinguish a high-confidence allele call from a low-depth artifact, nor model dropout (amplification failure on truly present alleles) separately from contamination and sequencing-artifact false positives. Analysts who have raw or aggregated read counts must collapse them to presence/absence before running moire, losing signal that could improve estimates of complexity of infection (COI), population allele frequencies, and within-host relatedness.
Solution
Add an alternative, opt-in count-based observation model that uses per-allele read counts as the primary input. False positives are modeled as noise reads on truly absent alleles (contamination / sequencing artifact). False negatives are modeled as dropout on truly present alleles (present in the latent genotype support set but emitting noise-level reads). The latent genotype remains a set of distinct alleles whose cardinality is bounded by COI—the same formulation that keeps transmission likelihood tractable with respect to COI. Transmission, relatedness, population structure, and COI priors are unchanged. Users select the observation model at MCMC runtime; data loading supports aggregating long-form input to per-allele counts instead of binary barcodes. The existing binary observation model remains the default for backward compatibility.
User Stories
As a moire user with amplicon read-count data, I want to load per-allele counts directly, so that I do not have to apply a hard presence/absence threshold before inference.
As a moire user, I want to choose between binary and count observation models when running MCMC, so that I can compare approaches without changing my downstream analysis workflow.
As a moire user on the default binary path, I want existing behavior unchanged, so that published analyses and scripts continue to work without modification.
As a moire user loading long-form data, I want an optional
readscolumn (defaulting to one read per row), so that I can supply counts either as repeated rows or explicit counts.As a moire user loading long-form data, I want an
aggregateoption to produce count barcodes instead of binary barcodes, so that the internal representation matches the count observation model.As a moire user loading delimited wide data, I want count aggregation to apply after pivoting to long form, so that wide-format workflows stay supported.
As a moire user running the count model, I want false positive reads modeled as Poisson noise on absent alleles, so that low-depth artifact alleles are treated as uncertain rather than definitive calls.
As a moire user running the count model, I want dropout modeled as a mixture on present alleles (signal vs noise), so that truly present alleles with zero or few reads remain plausible.
As a moire user, I want count-model FP and dropout rates to reuse the existing per-sample
eps_posandeps_negMCMC parameters and Beta priors, so that tuning and interpretation stay familiar.As a moire user, I want
eps_posmapped to a noise rate scaled by total locus depth and allele count, so that the FP parameter retains a per-locus expected-error interpretation similar to the binary model.As a moire user, I want
eps_negmapped to a dropout probability on present alleles, so that the FN parameter retains a dropout interpretation similar to the binary model.As a moire developer, I want observation likelihood isolated behind a small interface, so that binary and count models can be tested and extended without scattering conditionals through the MCMC chain.
As a moire developer, I want the binary observation logic moved behind that interface without behavior change, so that the refactor is safe before adding count logic.
As a moire developer, I want count observation likelihood computed per (sample, locus) from the latent support set and count vector, so that parallel observation likelihood evaluation stays embarrassingly parallel.
As a moire developer, I want the latent genotype sampler updated to use allele-specific FP/FN weights derived from count emissions, so that MH proposals for COI and sample-level parameters remain valid with correct adjustment terms.
As a moire developer, I want the combinatorial structure of latent genotype sampling preserved (FP/FN count constraints, uniform choice over combinations, log-prob correction), so that COI constraints and at-least-one-allele constraints still hold.
As a moire developer, I want transmission likelihood and P(any missing) logic untouched, so that COI marginalization properties of the current model are preserved.
As a moire developer, I want
observed_coiinitialization to use alleles with positive read counts (not sum of depths), so that naive COI hints remain sensible in count mode.As a moire user validating models, I want a count-aware simulator that generates read counts from true strain allele allocations, so that I can run simulation–recovery tests.
As a moire user, I want clear errors when count barcodes are paired with the binary model or binary barcodes with the count model, so that misconfiguration is caught early.
As a moire user converting data back to long form, I want count-aware round-trip behavior, so that export and visualization reflect read counts where applicable.
As a moire developer, I want C++ unit tests for count log-likelihood on hand-computed cases, so that numerical correctness is verified without full MCMC runs.
As a moire developer, I want R integration tests that run short MCMC with count data, so that the Rcpp path and MultiVector layout remain compatible.
As a moire user reading documentation, I want
run_mcmcand data-loading docs to describeobservation_modeland count aggregation, so that I can adopt the feature without reading C++.As a moire user, I want a NEWS entry describing the new observation model, so that I know what changed in the release.
As a moire developer, I want mixture log-likelihoods for present alleles evaluated with log-sum-exp, so that underflow does not corrupt inference.
As a moire developer, I want Poisson means floored at a small epsilon, so that zero-depth loci do not produce numerical errors.
As a moire user with missing loci, I want missing loci to contribute zero observation likelihood as today, so that the missing-data mechanism is unchanged.
As a moire user with uninformative loci (single allele), I want them removed at load time as today, so that data hygiene is consistent across observation models.
As a moire developer, I want Jaccard similarity and clustering-based initialization to treat an allele as present when its count is greater than zero, so that initialization heuristics remain coherent in count mode.
Implementation Decisions
Observation model module (deep module)
Introduce an observation-model abstraction with a narrow interface:
sample_latent_genotype).Two implementations:
Binary observation model — encapsulates current TP/TN/FP/FN independent Bernoulli error logic with per-allele rates scaled by locus allele count. No change to inferred results after refactor.
Count Poisson observation model (v1) — v1 emission spec:
Where N = sum of counts at the locus, A = number of alleles at the locus.
Parameter mapping from existing MCMC state:
Present-allele emissions use a two-component Poisson mixture marginalized in log-space (log-sum-exp). Absent-allele emissions use a single Poisson. v1 conditions on observed total depth N rather than modeling N separately.
The MCMC chain holds a unique pointer (or equivalent) to the selected observation model, constructed from a runtime
observation_modelparameter ("binary"|"counts"), default"binary".Data loading module
Extend long-form loading with:
aggregate:"binary"(default) |"count".readscolumn on long-form input; default 1 per row when absent.Binary aggregation: allele present → 1, absent → 0 (current behavior).
Count aggregation: sum
readsper (sample_id, locus, allele).Wide delimited loading passes aggregation through to long-form loader.
Validation:
aggregate = "binary"and duplicate (sample, locus, allele) rows exist without aggregation (or warn and sum—decision: error with clear message preferred for binary mode).observation_model = "binary".observation_model = "counts"and data are strictly 0/1 without documented threshold—at minimum, allow 0/1 counts but document that counts > 1 require count mode.Internal data structure remains list with
data(locus-indexed list of per-sample barcode vectors),sample_ids,loci,is_missing; barcode integers become non-negative counts in count mode.Genotyping data / observed COI
Genotyping data container continues to store per-(sample, locus, allele) integers. Update naive observed COI computation: for count mode, count alleles with n_a > 0 per locus, take max over loci (not sum of read depths).
Clustering initialization and Jaccard similarity: allele present if count > 0.
Chain and sampler integration
calculate_observation_likelihooddelegates to observation model log-likelihood.All call sites of
sample_latent_genotypepass through observation model (or observation model provides latent proposal).calc_transmission_process, transmission updates, population frequency updates, COI updates structure unchanged; only observation likelihood and latent resampling paths differ by model.Latent genotype storage format unchanged (sorted allele indices, -1 padding).
Parameters and R API
Add
observation_modelto MCMC arguments and C++ Parameters, default"binary".Add
aggregateto data loading functions, default"binary".Existing
eps_pos,eps_neg, Beta priors,max_eps_pos,max_eps_negretained; no new hyperparameters in v1.run_mcmcpassesobservation_modelthrough to Rcpp unchanged.Phased delivery (recommended PR sequence)
Each phase independently mergeable; phases 3–4 depend on 2.
Numerical details
Testing Decisions
What makes a good test: Exercise externally observable behavior—loaded data shapes, likelihood values on fixed inputs, MCMC smoke runs, simulation–recovery trends—not internal class names or private helpers. Prefer hand-computed likelihood cases where closed forms are easy. Keep MCMC tests minimal (very low burnin/samples) to avoid CI timeouts; skip on CRAN consistent with existing
test-mcmc.R.Modules to test:
cpp/tests/multivector_tests.cpp,cpp/tests/zeta_mobius_tests.cppstyleobservation_modeltest fileaggregate,reads)tests/testthat/test-utils.Rtests/testthat/test-mcmc.RR/simulator.R/ existing simulator tests if anyRegression: After binary refactor, verify binary observation likelihood matches pre-refactor values on fixed fixtures.
Optional: Simulation–recovery test with short MCMC on simulated count data; qualitative check that posteriors are sensible—not strict tolerance on stochastic output.
Out of Scope
Further Notes
The count observation model is designed around preserving the latent genotype as a support set of distinct alleles bounded by COI, because transmission likelihood already marginalizes strain assignments and relatedness given that set. Read counts inform membership in S through the observation layer only; they do not require abandoning the COI-friendly transmission formulation.
Future v2 extensions (not this PRD): NegBin emissions, per-locus noise/dropout hierarchies, empirical noise calibration from mono-infection controls, optional
min_readsfor observed-COI initialization.Default remains
observation_model = "binary"andaggregate = "binary"so manuscript-era analyses and existing scripts are unaffected.