This repository contains datasets for the rDNA assembly and analysis, using the individual PAN027 (HG06807) from the washu pedigree. The project aims to assemble as complete diploid rDNA arrays as possible, given the current state of technologies and algorithms. The datasets below were predominantly generated in Miga lab at UCSC, in collaborations with WashU.
- Whole-genome sequencing (WGS) + Adaptive sampling (AS) Dataset (ONT): Access via Google Drive
- Whole-genome sequencing using high-throughput flow cells (HTFC) Dataset (ONT): Access via Google Drive
| Dataset | Number of reads | Coverage | >Q15 (QV) | >Q30 (QV) | Sequencing Date |
|---|---|---|---|---|---|
| WGS | 14,922 | 27x | 92.8% | 0.8% | 06/2025 |
| AS | 13,774 | 29x | 92.4% | 2.1% | 08/2025 |
| HTFC | 18,569 | 35x | 96.3% | 2.3% | 11/2025 |
| WGS + AS | 28,696 | 56x | 92.6% | 1.4% | - |
| WGS + AS + HTFC | 47,265 | 91x | 94.1% | 1.8% | - |
These datasets were basecalled using a new experimental ONT model called hyperbasecalling, which offers higher precision compared to the standard SUP (Super Accuracy) model. However, please note that this increased precision comes with a computational cost, as hyperbasecalling is approximately 10x slower than SUP.
To identify reads containing rDNA units, we aligned the sequencing data against an rDNA array reference (10x KY962518-ROT reference) using minimap2 version 2.30-r1287. This method outperformed single-unit reference mapping (one KY962518-ROT), yielding significantly higher coverage.
Chromosome 14 maternal (active) and paternal (inactive) rDNA arrays.
| maternal rDNA array | paternal rDNA array |
|---|---|
| chr14 maternal | chr14 paternal |
- rDNA HiFi: Access via Google drive
- rRNA ONT: Access via S3 bucket