AutoBioPipe is a Python CLI for sequence-file quality control that goes beyond basic FASTA/FASTQ validation.
It detects file structure, computes core QC metrics, adds lightweight biological interpretation, renders plots, and produces terminal, JSON, CSV, and PDF reports in one run.
AutoBioPipe is aimed at fast first-pass QC for researchers, student projects, prototypes, and command-line workflows where you want something more informative than "file parsed successfully" but lighter than a full downstream pipeline stack.
Most quick QC scripts stop at counts and averages. AutoBioPipe is designed to give you a fuller first-pass read on a dataset:
- format detection for FASTA, FASTQ, and
.gzinputs - streaming parsers with explicit error handling
- GC, length, quality, and N-content QC metrics
- lightweight biological heuristics for coding potential, GC context, duplication, low complexity, homopolymers, and GC skew
- generated plots for GC distribution, read lengths, and quality profiles
- a configurable decision engine with rule-based findings and severity scoring
- machine-readable and human-readable reports from the same run
AutoBioPipe runs the following stages:
detect -> parse -> qc -> biology -> visualization -> decision -> report
That means a single command can tell you:
- what kind of file you gave it
- whether the records parse cleanly
- whether basic QC metrics look suspicious
- whether the sequences show biologically unusual patterns
- what concrete findings were triggered
AutoBioPipe currently supports:
- FASTA
- FASTQ
- gzipped FASTA and FASTQ
- single-end files
- paired-end filename hints inferred from names such as
R1andR2
Important detail:
- paired-end detection is heuristic and filename-based
- it is helpful metadata, not a guarantee about the experiment design
- AutoBioPipe currently analyzes one input file at a time rather than jointly processing read pairs
git clone https://github.com/NischayMehta403/autobiopipe.git
cd autobiopipe
python -m venv .venv
source .venv/bin/activate
pip install -e .For development:
pip install -e .[dev]Using a virtual environment is recommended. On many Linux distributions, installing directly into the system Python environment is blocked by default under PEP 668.
Run the full pipeline on the bundled sample FASTQ:
autobio run examples/sample.fastq --output-dir resultsInspect detection only:
autobio detect examples/sample.fastaRender a saved JSON report back to the terminal:
autobio report results/sample.fastq_report.jsonShow the installed version:
autobio versionRun a FASTA example:
autobio run examples/sample.fasta --output-dir results_fastaRun with more verbose logging:
autobio run examples/sample.fastq --output-dir results --verboseA typical run writes:
*_report.json*_report.csv*_report.pdfgc_distribution.pnglength_distribution.pngquality_profile.png
The JSON report is the canonical structured output. CSV is useful for spreadsheets and downstream tabular review. PDF is intended for quick sharing and archival summaries.
Generated plots include:
- GC distribution
- sequence or read length distribution
- per-position quality profile for FASTQ inputs
One practical workflow looks like this:
- Run
autobio detectto confirm the file type and compression handling look correct. - Run
autobio runto generate reports and plots. - Inspect the terminal summary for immediate findings.
- Use the JSON output for programmatic downstream use.
- Share the PDF or CSV output with collaborators if needed.
autobio run <input_file> [--output-dir DIR] [--config FILE] [--verbose]
- runs the complete pipeline
- returns exit code
0forPASSorWARNING - returns exit code
1for runtime or input errors - returns exit code
2when the pipeline completes but the overall decision isFAIL
autobio detect <input_file>
- prints detected file type, compression status, confidence, and inferred sequencing layout
autobio report <json_file>
- re-renders a previously generated JSON report in the terminal
autobio version
- prints the AutoBioPipe version and Python version
The decision engine currently includes rules QC001 through QC015.
Covered finding areas include:
- low average quality
- GC anomaly or extreme GC content
- minimum length and abnormal length spread
- high ambiguous-base fraction
- suspiciously small datasets
- low complexity and homopolymer-heavy sequences
- weak coding potential or ORF absence
- suspicious duplication
- GC skew imbalance
- unusually short sequences
Findings are reported with weighted severities:
INFO = 1WARNING = 2CRITICAL = 3
This makes the final status more informative than a simple single-threshold pass/fail.
AutoBioPipe uses TOML configuration files.
Examples:
Main config sections:
[qc]for core QC thresholds[biology]for biological heuristics and advanced rule thresholds[visualization]for plot settings[pipeline]for runtime and output behavior
Run with a custom config:
autobio run examples/sample.fastq --config examples/config.tomlRepresentative configuration knobs include:
qc.min_avg_qualityqc.gc_content_minqc.gc_content_maxqc.min_read_lengthqc.small_dataset_thresholdbiology.low_complexity_fraction_thresholdbiology.homopolymer_fraction_thresholdbiology.duplicate_fraction_warning_thresholdvisualization.dpipipeline.output_dirpipeline.max_records
Minimal example:
[qc]
min_avg_quality = 25.0
gc_content_min = 30.0
gc_content_max = 70.0
[pipeline]
output_dir = "results"
max_records = 5000Use the stricter example config if you want a harsher QC posture for demos or experiments:
The generated JSON report includes top-level sections for:
detectionqcdecisionvisualizations
That makes it suitable for:
- scripting
- notebook analysis
- downstream dashboards
- regression checks in automated workflows
The terminal report is intended for fast review, while JSON is the best format for automation.
AutoBioPipe uses exit codes deliberately:
0means the run succeeded and the overall status wasPASSorWARNING1means there was an input or runtime error2means the pipeline completed, but the rule engine classified the result asFAIL
That makes it easy to integrate into shell scripts and CI pipelines.
Some implementation choices are intentional:
- parsers are generator-based for memory efficiency
- configuration is TOML-based and human-editable
- decision rules are registry-driven so new checks can be added cleanly
- biological interpretation is heuristic and interpretable rather than black-box ML
- visualization uses a non-interactive matplotlib backend for CLI and CI safety
- autobiopipe/ contains the CLI and pipeline modules
- tests/ contains parser, QC, decision, pipeline, visualization, biology, and CLI coverage
- docs/ARCHITECTURE.md explains the stage design
- docs/CONTRIBUTING.md documents contribution guidance
The test suite is exercised with pytest, and the repository includes GitHub Actions CI in .github/workflows/test.yml.
Recent local verification:
112 passed- coverage threshold met at
90.56%
Run locally:
pytestAutoBioPipe is maintained by NischayMehta403.
This project is released under the MIT License.
See LICENSE for the full text.
- Empty files, malformed FASTQ records, multiline FASTA sequences, gzip compression, and paired-end filename hints are handled explicitly.
- The bundled example files are intentionally small so the project stays lightweight for demos and tests.
examples/sample.fastqis intentionally harsh under the advanced rule set and may returnFAIL; treat it as a regression fixture, not as a biologically representative "good" sample.