Skip to content

man4ish/omnibioai-test-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

omnibioai-test-data

Comprehensive test dataset registry for the OmniBioAI platform — covers 35+ bioinformatics domains across 613 workflow bundles, all validated at 100% pass rate on ARM64 (NVIDIA DGX Spark).

Actual data files are not stored in this repository — they are sourced from public databases (SRA, 10x Genomics, GENCODE, NCBI) via the provided download script. This repo tracks the metadata, bundle map, validation script, and test run results.


Test Coverage

Domain Data Source Size
RNA-seq SRA (subsampled) ~130 MB
Single Cell (scRNA-seq) 10x PBMC 1k v3 ~5 GB
Spatial Transcriptomics (Xenium) 10x Genomics public ~19 GB
Spatial Transcriptomics (Visium) 10x Genomics public ~1 MB
WGS NA12878 (GIAB) ~800 MB
WES NA12878 (GIAB) ~800 MB
Structural Variants NA12878 (GIAB) ~800 MB
ctDNA Analysis NA12878 (GIAB) ~800 MB
ATAC-seq SRR5799393 ~2.4 GB
ChIP-seq SRR227563 ~2.0 GB
Methylation SRR3368180 ~116 MB
Epigenomics H3K27ac ChIP ~5 MB
Long Read (Nanopore) Synthetic/public ~27 MB
Microbiome (16S) SRR2822457 ~11 GB
Metagenomics Community reads ~3 MB
Proteomics TMT Erwinia mzXML ~232 MB
CRISPR Demo library ~56 KB
Multimodal (CITE-seq) Public ~7 MB
Multiomics (RNA+ATAC) Public ~4 MB
Clinical NA12878 variants ~8 KB
Drug Discovery Compounds CSV ~8 KB
Foundation Models Sequences FASTA ~56 KB
Target Identification Variants VCF ~8 KB

Additional domains using shared data: circRNA, lncRNA, miRNA-seq, Ribo-seq, ancient DNA, HLA typing, pharmacogenomics, immune deconvolution, Hi-C, pangenome, genome assembly, population genetics, polygenic risk score, variant ML, clinical trial matching, biological knowledge graph.


Reference Data

Resource Source Size
hg38 chr21+22 FASTA UCSC ~90 MB
hg38 full genome UCSC ~900 MB
mm10 reference UCSC/Ensembl ~800 MB
GENCODE v44 GTF (human) GENCODE ~40 MB
GENCODE vM25 GTF (mouse) GENCODE ~40 MB
STAR index hg38 chr21+22 Built locally ~700 MB
BWA index hg38 chr21+22 Built locally ~250 MB
Bismark index hg38 chr21+22 Built locally ~90 MB
CellRanger GRCh38 2020-A 10x Genomics ~1 GB
Kraken2 standard DB Kraken2 ~8 GB
SILVA 138 16S sequences SILVA ~90 MB
dbSNP VCF NCBI ~15 GB
ClinVar VCF NCBI ~77 MB

Repository Contents

omnibioai-test-data/
├── bundle_data_map.json      # Maps workflow bundles → input files and reference data
├── validate_test_data.sh     # Validates local test data completeness and sizes
├── download_test_data.sh     # Downloads all public datasets from SRA, 10x, NCBI, GENCODE
├── logs/
│   ├── master_report.md      # Original test run — 613 bundles, 99.5% pass rate
│   ├── master_report_v2.md   # Updated after all fixes — 613 bundles, 100% pass rate
│   ├── all_batches_run.log   # Full execution log
│   ├── download_summary.json # Download provenance and checksums
│   ├── retest_results.json   # Results of retested bundles
│   └── bundle_batches.json   # Batch grouping configuration
└── README.md

Test Results

Metric Value
Total bundles tested 613
Pass rate 100%
Domains covered 35+
Batches 31
Platform NVIDIA DGX Spark (Grace Blackwell, aarch64)
Execution backends Local Docker, Singularity/Apptainer
Reference genome hg38 chr21+22 (subset for speed)

Setup

1. Clone this repo

git clone https://github.com/man4ish/omnibioai-test-data.git
cd omnibioai-test-data

2. Download test data

# Download everything (~65 GB total)
bash download_test_data.sh

# Download a specific domain only
bash download_test_data.sh --domain rnaseq
bash download_test_data.sh --domain wgs
bash download_test_data.sh --domain singlecell

3. Validate your local data

bash validate_test_data.sh

Expected output:

=== SUMMARY ===
  PASS: 35
  FAIL/MISSING: 0
  Total size: ~65G

Running Bundle Tests

Once data is downloaded, run the full bundle test suite via OmniBioAI TES:

# Run all bundles
python omnibioai-workflow-bundles/scripts/run_bundle_tests.py \
    --data-map bundle_data_map.json \
    --output logs/

# Run a specific domain
python omnibioai-workflow-bundles/scripts/run_bundle_tests.py \
    --domain rnaseq \
    --data-map bundle_data_map.json

Data Sources

All test data is sourced from public repositories:

Dataset Accession / URL
NA12878 WGS/WES GIAB
PBMC 1k v3 scRNA-seq 10x Genomics
Xenium spatial 10x Genomics
Visium brain 10x Genomics
ATAC-seq SRR5799393
ChIP-seq SRR227563
Methylation SRR3368180
Microbiome 16S SRR2822457
hg38 / mm10 UCSC
GENCODE v44 GENCODE
SILVA 138 SILVA
dbSNP NCBI
ClinVar NCBI

Related Repositories

About

Test dataset registry for OmniBioAI — 613 workflow bundles across 35+ bioinformatics domains validated at 100% pass rate on ARM64 NVIDIA DGX Spark.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages