Comprehensive test dataset registry for the OmniBioAI platform — covers 35+ bioinformatics domains across 613 workflow bundles, all validated at 100% pass rate on ARM64 (NVIDIA DGX Spark).
Actual data files are not stored in this repository — they are sourced from public databases (SRA, 10x Genomics, GENCODE, NCBI) via the provided download script. This repo tracks the metadata, bundle map, validation script, and test run results.
| Domain | Data Source | Size |
|---|---|---|
| RNA-seq | SRA (subsampled) | ~130 MB |
| Single Cell (scRNA-seq) | 10x PBMC 1k v3 | ~5 GB |
| Spatial Transcriptomics (Xenium) | 10x Genomics public | ~19 GB |
| Spatial Transcriptomics (Visium) | 10x Genomics public | ~1 MB |
| WGS | NA12878 (GIAB) | ~800 MB |
| WES | NA12878 (GIAB) | ~800 MB |
| Structural Variants | NA12878 (GIAB) | ~800 MB |
| ctDNA Analysis | NA12878 (GIAB) | ~800 MB |
| ATAC-seq | SRR5799393 | ~2.4 GB |
| ChIP-seq | SRR227563 | ~2.0 GB |
| Methylation | SRR3368180 | ~116 MB |
| Epigenomics | H3K27ac ChIP | ~5 MB |
| Long Read (Nanopore) | Synthetic/public | ~27 MB |
| Microbiome (16S) | SRR2822457 | ~11 GB |
| Metagenomics | Community reads | ~3 MB |
| Proteomics | TMT Erwinia mzXML | ~232 MB |
| CRISPR | Demo library | ~56 KB |
| Multimodal (CITE-seq) | Public | ~7 MB |
| Multiomics (RNA+ATAC) | Public | ~4 MB |
| Clinical | NA12878 variants | ~8 KB |
| Drug Discovery | Compounds CSV | ~8 KB |
| Foundation Models | Sequences FASTA | ~56 KB |
| Target Identification | Variants VCF | ~8 KB |
Additional domains using shared data: circRNA, lncRNA, miRNA-seq, Ribo-seq, ancient DNA, HLA typing, pharmacogenomics, immune deconvolution, Hi-C, pangenome, genome assembly, population genetics, polygenic risk score, variant ML, clinical trial matching, biological knowledge graph.
| Resource | Source | Size |
|---|---|---|
| hg38 chr21+22 FASTA | UCSC | ~90 MB |
| hg38 full genome | UCSC | ~900 MB |
| mm10 reference | UCSC/Ensembl | ~800 MB |
| GENCODE v44 GTF (human) | GENCODE | ~40 MB |
| GENCODE vM25 GTF (mouse) | GENCODE | ~40 MB |
| STAR index hg38 chr21+22 | Built locally | ~700 MB |
| BWA index hg38 chr21+22 | Built locally | ~250 MB |
| Bismark index hg38 chr21+22 | Built locally | ~90 MB |
| CellRanger GRCh38 2020-A | 10x Genomics | ~1 GB |
| Kraken2 standard DB | Kraken2 | ~8 GB |
| SILVA 138 16S sequences | SILVA | ~90 MB |
| dbSNP VCF | NCBI | ~15 GB |
| ClinVar VCF | NCBI | ~77 MB |
omnibioai-test-data/
├── bundle_data_map.json # Maps workflow bundles → input files and reference data
├── validate_test_data.sh # Validates local test data completeness and sizes
├── download_test_data.sh # Downloads all public datasets from SRA, 10x, NCBI, GENCODE
├── logs/
│ ├── master_report.md # Original test run — 613 bundles, 99.5% pass rate
│ ├── master_report_v2.md # Updated after all fixes — 613 bundles, 100% pass rate
│ ├── all_batches_run.log # Full execution log
│ ├── download_summary.json # Download provenance and checksums
│ ├── retest_results.json # Results of retested bundles
│ └── bundle_batches.json # Batch grouping configuration
└── README.md
| Metric | Value |
|---|---|
| Total bundles tested | 613 |
| Pass rate | 100% |
| Domains covered | 35+ |
| Batches | 31 |
| Platform | NVIDIA DGX Spark (Grace Blackwell, aarch64) |
| Execution backends | Local Docker, Singularity/Apptainer |
| Reference genome | hg38 chr21+22 (subset for speed) |
git clone https://github.com/man4ish/omnibioai-test-data.git
cd omnibioai-test-data# Download everything (~65 GB total)
bash download_test_data.sh
# Download a specific domain only
bash download_test_data.sh --domain rnaseq
bash download_test_data.sh --domain wgs
bash download_test_data.sh --domain singlecellbash validate_test_data.shExpected output:
=== SUMMARY ===
PASS: 35
FAIL/MISSING: 0
Total size: ~65G
Once data is downloaded, run the full bundle test suite via OmniBioAI TES:
# Run all bundles
python omnibioai-workflow-bundles/scripts/run_bundle_tests.py \
--data-map bundle_data_map.json \
--output logs/
# Run a specific domain
python omnibioai-workflow-bundles/scripts/run_bundle_tests.py \
--domain rnaseq \
--data-map bundle_data_map.jsonAll test data is sourced from public repositories:
| Dataset | Accession / URL |
|---|---|
| NA12878 WGS/WES | GIAB |
| PBMC 1k v3 scRNA-seq | 10x Genomics |
| Xenium spatial | 10x Genomics |
| Visium brain | 10x Genomics |
| ATAC-seq | SRR5799393 |
| ChIP-seq | SRR227563 |
| Methylation | SRR3368180 |
| Microbiome 16S | SRR2822457 |
| hg38 / mm10 | UCSC |
| GENCODE v44 | GENCODE |
| SILVA 138 | SILVA |
| dbSNP | NCBI |
| ClinVar | NCBI |
omnibioai-workflow-bundles— the 613 workflow bundles being testedomnibioai-tool-images— ARM64 Docker/Singularity images used by bundlesomnibioai-tes— Task Execution Service that runs the bundlesomnibioai-data— runtime data directory