This repository estimates copy number variation (CNV) from mad4hatter sequencing output. Use estCNV.qmd to run the analysis on your data.
- mad4hatter results — Process your sequencing data with mad4hatter first. You need one or more run directories, each containing
allele_data.txt. - R — Install R.
- R packages — Install the packages loaded at the top of
estCNV.qmd:tidyversemgcvggbeeswarmgridExtrareadxl
-
Clone or download this repository.
-
Open
estCNV.qmdin RStudio, VS Code, or your preferred editor. -
Set the user variables at the top of the document (see Configuration below).
-
Render the document or run the chunks interactively in R.
-
Review the HTML report and output CSV files.
Set allele.data.path to the parent directory that contains your mad4hatter run folders. The script searches recursively for allele_data.txt files.
Each run folder name becomes the Run identifier and must match the Run column in your manifest.
allele_data.txt must include these columns:
| Column | Description |
|---|---|
sample_name |
Sample identifier |
target_name |
Amplicon target name |
reads |
Read count per allele |
pool |
Primer pool used for the target |
Provide a sample manifest as CSV, TSV, or Excel (.xlsx). An example is in madh_utilities_estCNV/manifest_example.csv.
Required columns:
| Column | Description |
|---|---|
sample_name |
Sample ID. May be repeated across batches. Must match allele_data.txt after optional name standardisation. |
CNVControl |
TRUE if the sample is a control known not to have a CNV; otherwise FALSE. |
Batch |
Library preparation batch. Samples prepped together share a batch. |
SuperBatch |
Group of batches where controls and samples are comparable. Correction factors are calculated per SuperBatch. |
Run |
Sequencing run ID. Must match the mad4hatter output folder name containing that sample's allele_data.txt. |
The analysis uses the panel_information/ directory bundled with this repository. Pool names in your allele_data.txt must match subdirectory names under panel_information/.
For each pool in your data, the script expects:
panel_information/
├── validated_target_filter.tsv # Targets included in CNV analysis
├── D1.1/
│ ├── D1.1_amplicon_info.tsv # Amplicon coordinates and target metadata
│ └── D1.1_target_of_interest.tsv # CNV targets of interest, grouped by gene
├── R1.2/
│ ├── R1.2_amplicon_info.tsv
│ └── R1.2_target_of_interest.tsv
└── ...
{pool}_amplicon_info.tsv— Defines amplicon coordinates and target names for each pool.{pool}_target_of_interest.tsv— Lists CNV targets of interest and their gene groups (e.g.HRP2,HRP3,PM2).validated_target_filter.tsv— Subset of targets used for filtering and normalisation. Only targets present in both this file and your pools' amplicon info are analysed.
All sequencing runs in a single analysis must use the same set of pools.
Edit the parameter chunks near the top of estCNV.qmd. You will likely only need to change allele.data.path and manifest.path:
allele.data.path = "/path/to/mad4hatter/outputs/"
manifest.path = "path/to/manifest.csv"
panel.info.dir = "panel_information/"
validated.target.filter.path = "panel_information/validated_target_filter.tsv"
standardise_sample_name = TRUE # Trim sequencing suffixes from sample names
use.controls.only = FALSE # Use only controls (TRUE) or all samples (FALSE) for median correctionAdjust these thresholds as needed:
| Parameter | Default | Description |
|---|---|---|
reaction.threshold.per.target |
20 | Minimum mean reads per target within a reaction |
Pools.ratio |
100 | Maximum fold difference allowed between the two reactions |
target.read.threshold |
50 | Minimum reads per locus |
ntarget.proportion.threshold |
0.8 | Minimum proportion of amplicons meeting target.read.threshold |
- Read and validate the manifest.
- Load
allele_data.txtfrom all runs and detect primer pools. - Load amplicon info and targets of interest from
panel_information/for each pool. - Filter samples and targets by read depth and validated target list.
- Estimate CNV fold changes using a GAM-based normalisation (
estCNVfunction). - Use CNV-negative controls within each SuperBatch to calculate correction factors.
- Apply correction factors and produce final per-gene fold changes.
Review the control plots in the rendered report before interpreting sample results. You may need to adjust Batch or SuperBatch assignments in the manifest if controls cluster unexpectedly.
Results are written to allele.data.path:
| File | Description |
|---|---|
fold_changes_output.csv |
Fold changes per target group and sample |
fold_changes_final_output.csv |
Per-gene fold changes (maximum across target groups within each gene) |
The rendered HTML report includes QC plots, control diagnostics, and heatmaps of fold changes across samples.
The panel_information/ directory includes amplicon and target-of-interest files for mad4hatter primer pools:
| Pool | Description |
|---|---|
| D1.1 (D1) | 165 high-diversity targets plus ldh loci for species identification |
| R1.1 | 82 resistance, species ID, and immune-related targets |
| R1.2 (R1) | Reduced R1 pool (47 targets) for improved sensitivity |
| R2.1 (R2) | 31 complementary resistance targets |
| M1.1 (M1) | PfPHAST minimal panel (41 targets) |
| M2.1 (M2) | PfPHAST complementary panel (15 targets) |
| M1.addon | Non-P. falciparum mitochondrial cytb targets for M1 |
For MADHatTeR, the recommended configuration is two mPCR reactions: one with D1 and R1.2 primers, and one with R2 primers. More protocol details are available from the EPPIcenter.
If your data uses different pool names (e.g. D1 instead of D1.1), ensure the pool column in allele_data.txt matches the directory names under panel_information/.