This repository provides a beginner-friendly, reproducible tutorial that demonstrates a typical SNP microarray genotyping flow:
IDAT (raw intensities) → GTC (genotype calling) → VCF (standard variant format) → PLINK (bed/bim/fam)
All steps are executed as Bash commands inside a Jupyter notebook (%%bash), so learners can clearly see inputs → commands → outputs → validation.
- What each file type means (
.idat,.gtc,.vcf.gz,.bed/.bim/.fam) - Why each stage exists (calling, normalization, sorting, indexing, conversion)
- How to inspect outputs to confirm each step worked
- How to keep a pipeline reproducible via a single configuration file (
params.sh)
Key files:
genotyping_pipeline_bash_tutorial.ipynb— main notebook (bash-only tutorial)prs-genesis-env.yml— Conda environment (bcftools + gtc2vcf plugin + plink)data/— reference files and sample data location (paths configured inparams.sh)tools/— vendor tool bundle (iaap-cli) + wrapperparams.sh— pipeline configuration (created/edited from the notebook)
Recommended OS:
- Ubuntu Linux or WSL2 (Windows)
Required tools:
gitconda(Miniconda/Anaconda/Mamba)
The pipeline uses:
iaap-cli(vendor tool for genotype calling; provided separately)bcftools(+gtc2vcfplugin)plink
cd ~
git clone https://github.com/synmutanX/genesis.git
cd genesisconda env create -f prs-genesis-env.yml
conda activate prs-genesis-envVerify:
bcftools --version | head -n 2
plink --version | head -n 2If JupyterLab is not installed yet:
conda install -c conda-forge -y jupyterlab
iaap-cliis not installed via Conda. You typically receive it as a.tar.gzbundle.
Assume your .tar.gz is placed under tools/.
cd ~/genesis/tools
tar -xzf iaap-cli-linux-x64-1.1.0-*.tar.gz
rm -rf iaap-cli-bin
mkdir -p iaap-cli-bin
# Copy EVERYTHING from the bundle (important: dll/json/etc.)
cp -a ./iaap-cli-linux-x64-1.1.0-sha.*/iaap-cli/* ./iaap-cli-bin/
chmod +x ./iaap-cli-bin/iaap-cliOn some Ubuntu/WSL setups, iaap-cli can fail with an ICU/globalization error.
Use this wrapper so learners don’t need to export env vars manually each time:
cat > ./iaap-cli-bin/iaap-cli.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
export DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=1
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
exec "$DIR/iaap-cli" "$@"
EOF
chmod +x ./iaap-cli-bin/iaap-cli.shTest:
./iaap-cli-bin/iaap-cli.sh --helpecho 'export PATH="$HOME/genesis/tools/iaap-cli-bin:$PATH"' >> ~/.bashrc
source ~/.bashrcTest:
iaap-cli.sh --helpIf you use zsh, replace
~/.bashrcwith~/.zshrc.
Configure all paths in params.sh (the notebook helps generate this file).
Minimum inputs:
*.bpm— manifest (binary)*.csv— manifest (CSV)*.egt— cluster file
*.fasta— reference genome (e.g., GRCh38)
If you see FASTA index errors, install samtools and index the FASTA:
conda install -c bioconda -y samtools
samtools faidx data/data-reference/human_v38.fastaA folder containing paired IDAT files:
*_Red.idat*_Grn.idat
asa2rsid.map— chip variant ID → rsID mappingchr.map— chromosome rename mapping (e.g.,chr1→1)
(Optional for sex-check)
- population frequency file under
${POP_DIR}/${POP}/...(used in the notebook if enabled)
From the repo root:
cd ~/genesis
conda activate prs-genesis-env
jupyter labIn JupyterLab:
- Open
genotyping_pipeline_bash_tutorial.ipynb - Run cells top-to-bottom (Step 0 → Step 7)
Even though the kernel may show Python, the tutorial cells run Bash via
%%bash.
After a successful run, outputs are written under output/:
output/gtc/*.gtc
output/vcf/38/out_vcf.vcf.gzoutput/vcf/38/out_vcf.vcf.csioutput/vcf/38/out_vcf.tsv
output/plink/38/out_plink.bedoutput/plink/38/out_plink.bimoutput/plink/38/out_plink.fam
output/allel/38/variant.listoutput/vcf/38/out23.txt
-
Confirm wrapper exists:
ls -lh ~/genesis/tools/iaap-cli-bin/iaap-cli.sh -
Confirm PATH contains
iaap-cli-bin:echo $PATH | tr ':' '\n' | grep iaap-cli-bin
-
Test:
iaap-cli.sh --help
Use the wrapper iaap-cli.sh.
If running directly:
export DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=1
./iaap-cli --helpCheck:
bcftools plugin -l | grep gtc2vcfMake sure bcftools-gtc2vcf-plugin is installed in the conda env.
samtools faidx path/to/reference.fastaTo make results consistent across students:
- Use the same OS environment (Ubuntu/WSL recommended)
- Use the same conda environment (
prs-genesis-env.yml) - Use the same input files (chip bundle + reference FASTA + mappings)
- Run notebook in order
- Keep outputs in the default
output/structure
iaap-cliis a vendor tool with its own license/distribution rules.- Any chip/reference/sample data may have restrictions — only share if permitted.