AutoBioPipe V 0.1.0

AutoBioPipe is a Python CLI for sequence-file quality control that goes beyond basic FASTA/FASTQ validation.

It detects file structure, computes core QC metrics, adds lightweight biological interpretation, renders plots, and produces terminal, JSON, CSV, and PDF reports in one run.

AutoBioPipe is aimed at fast first-pass QC for researchers, student projects, prototypes, and command-line workflows where you want something more informative than "file parsed successfully" but lighter than a full downstream pipeline stack.

Why AutoBioPipe

Most quick QC scripts stop at counts and averages. AutoBioPipe is designed to give you a fuller first-pass read on a dataset:

format detection for FASTA, FASTQ, and .gz inputs
streaming parsers with explicit error handling
GC, length, quality, and N-content QC metrics
lightweight biological heuristics for coding potential, GC context, duplication, low complexity, homopolymers, and GC skew
generated plots for GC distribution, read lengths, and quality profiles
a configurable decision engine with rule-based findings and severity scoring
machine-readable and human-readable reports from the same run

Pipeline At A Glance

AutoBioPipe runs the following stages:

detect -> parse -> qc -> biology -> visualization -> decision -> report

That means a single command can tell you:

what kind of file you gave it
whether the records parse cleanly
whether basic QC metrics look suspicious
whether the sequences show biologically unusual patterns
what concrete findings were triggered

Supported Inputs

AutoBioPipe currently supports:

FASTA
FASTQ
gzipped FASTA and FASTQ
single-end files
paired-end filename hints inferred from names such as R1 and R2

Important detail:

paired-end detection is heuristic and filename-based
it is helpful metadata, not a guarantee about the experiment design
AutoBioPipe currently analyzes one input file at a time rather than jointly processing read pairs

Installation

git clone https://github.com/NischayMehta403/autobiopipe.git
cd autobiopipe
python -m venv .venv
source .venv/bin/activate
pip install -e .

For development:

pip install -e .[dev]

Using a virtual environment is recommended. On many Linux distributions, installing directly into the system Python environment is blocked by default under PEP 668.

Quick Start

Run the full pipeline on the bundled sample FASTQ:

autobio run examples/sample.fastq --output-dir results

Inspect detection only:

autobio detect examples/sample.fasta

Render a saved JSON report back to the terminal:

autobio report results/sample.fastq_report.json

Show the installed version:

autobio version

Run a FASTA example:

autobio run examples/sample.fasta --output-dir results_fasta

Run with more verbose logging:

autobio run examples/sample.fastq --output-dir results --verbose

What You Get

A typical run writes:

*_report.json
*_report.csv
*_report.pdf
gc_distribution.png
length_distribution.png
quality_profile.png

The JSON report is the canonical structured output. CSV is useful for spreadsheets and downstream tabular review. PDF is intended for quick sharing and archival summaries.

Generated plots include:

GC distribution
sequence or read length distribution
per-position quality profile for FASTQ inputs

Example Workflow

One practical workflow looks like this:

Run autobio detect to confirm the file type and compression handling look correct.
Run autobio run to generate reports and plots.
Inspect the terminal summary for immediate findings.
Use the JSON output for programmatic downstream use.
Share the PDF or CSV output with collaborators if needed.

CLI

autobio run <input_file> [--output-dir DIR] [--config FILE] [--verbose]

runs the complete pipeline
returns exit code 0 for PASS or WARNING
returns exit code 1 for runtime or input errors
returns exit code 2 when the pipeline completes but the overall decision is FAIL

autobio detect <input_file>

prints detected file type, compression status, confidence, and inferred sequencing layout

autobio report <json_file>

re-renders a previously generated JSON report in the terminal

autobio version

prints the AutoBioPipe version and Python version

Decision Rules

The decision engine currently includes rules QC001 through QC015.

Covered finding areas include:

low average quality
GC anomaly or extreme GC content
minimum length and abnormal length spread
high ambiguous-base fraction
suspiciously small datasets
low complexity and homopolymer-heavy sequences
weak coding potential or ORF absence
suspicious duplication
GC skew imbalance
unusually short sequences

Findings are reported with weighted severities:

INFO = 1
WARNING = 2
CRITICAL = 3

This makes the final status more informative than a simple single-threshold pass/fail.

Configuration

AutoBioPipe uses TOML configuration files.

Examples:

Main config sections:

[qc] for core QC thresholds
[biology] for biological heuristics and advanced rule thresholds
[visualization] for plot settings
[pipeline] for runtime and output behavior

Run with a custom config:

autobio run examples/sample.fastq --config examples/config.toml

Representative configuration knobs include:

qc.min_avg_quality
qc.gc_content_min
qc.gc_content_max
qc.min_read_length
qc.small_dataset_threshold
biology.low_complexity_fraction_threshold
biology.homopolymer_fraction_threshold
biology.duplicate_fraction_warning_threshold
visualization.dpi
pipeline.output_dir
pipeline.max_records

Minimal example:

[qc]
min_avg_quality = 25.0
gc_content_min = 30.0
gc_content_max = 70.0

[pipeline]
output_dir = "results"
max_records = 5000

Use the stricter example config if you want a harsher QC posture for demos or experiments:

examples/config_strict.toml

Output Structure

The generated JSON report includes top-level sections for:

detection
qc
decision
visualizations

That makes it suitable for:

scripting
notebook analysis
downstream dashboards
regression checks in automated workflows

The terminal report is intended for fast review, while JSON is the best format for automation.

Exit Codes

AutoBioPipe uses exit codes deliberately:

0 means the run succeeded and the overall status was PASS or WARNING
1 means there was an input or runtime error
2 means the pipeline completed, but the rule engine classified the result as FAIL

That makes it easy to integrate into shell scripts and CI pipelines.

Design Notes

Some implementation choices are intentional:

parsers are generator-based for memory efficiency
configuration is TOML-based and human-editable
decision rules are registry-driven so new checks can be added cleanly
biological interpretation is heuristic and interpretable rather than black-box ML
visualization uses a non-interactive matplotlib backend for CLI and CI safety

Project Structure

autobiopipe/ contains the CLI and pipeline modules
tests/ contains parser, QC, decision, pipeline, visualization, biology, and CLI coverage
docs/ARCHITECTURE.md explains the stage design
docs/CONTRIBUTING.md documents contribution guidance

Verification

The test suite is exercised with pytest, and the repository includes GitHub Actions CI in .github/workflows/test.yml.

Recent local verification:

112 passed
coverage threshold met at 90.56%

Run locally:

pytest

Author

AutoBioPipe is maintained by NischayMehta403.

License

This project is released under the MIT License.

See LICENSE for the full text.

Notes

Empty files, malformed FASTQ records, multiline FASTA sequences, gzip compression, and paired-end filename hints are handled explicitly.
The bundled example files are intentionally small so the project stays lightweight for demos and tests.
examples/sample.fastq is intentionally harsh under the advanced rule set and may return FAIL; treat it as a regression fixture, not as a biologically representative "good" sample.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
autobiopipe.egg-info		autobiopipe.egg-info
autobiopipe		autobiopipe
autobiopipe_output		autobiopipe_output
docs		docs
examples		examples
my_results		my_results
real_data		real_data
tests		tests
tmp_portfolio_run		tmp_portfolio_run
tmp_portfolio_run2		tmp_portfolio_run2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoBioPipe V 0.1.0

Why AutoBioPipe

Pipeline At A Glance

Supported Inputs

Installation

Quick Start

What You Get

Example Workflow

CLI

Decision Rules

Configuration

Output Structure

Exit Codes

Design Notes

Project Structure

Verification

Author

License

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoBioPipe V 0.1.0

Why AutoBioPipe

Pipeline At A Glance

Supported Inputs

Installation

Quick Start

What You Get

Example Workflow

CLI

Decision Rules

Configuration

Output Structure

Exit Codes

Design Notes

Project Structure

Verification

Author

License

Notes

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages