biogen_analytics

Bioinformatics pipeline for genomic analysis of FASTA sequences: DNA/protein metrics calculation, NCBI homology search, statistical modeling with linear regression, and anomaly detection.

Description

This project analyzes human DNA sequences (FASTA format) by applying a 4-phase pipeline:

Genomic Processing — GC%, length, DNA→protein translation, molecular weight, aromaticity.
Online Homology (optional) — BLAST search against the NCBI nt database and reference downloading.
AI Model — Linear regression (length → GC%) with anomaly detection (>2σ).
Visualization — Correlation heatmap generated with seaborn.

The included sample genes are: TP53, BRCA1, APOE, and KCNJ1.

Project Structure

text
Biogen_analytics/
├── data/
│   ├── raw/                  # Input FASTAs (.fasta / .fa)
│   ├── processed/            # features_genes.csv (generated)
│   └── reference/            # References downloaded from NCBI (online mode)
├── graficos/                 # heatmap_correlacion.png (generated)
├── logs/                     # pipeline.log (generated)
├── resultados/               # Results CSV and PNG (generated)
├── src/
│   ├── __init__.py           # CLI entry point
│   ├── descarga_sec.py       # Bulk download of FASTAs from NCBI
│   └── fetch_tools.py        # Reference FASTA downloading from NCBI
│   ├── homology_search.py    # BLAST + GenBank details extraction
│   ├── model_ia.py           # Linear regression + anomaly detector
│   ├── procesado_sec.py      # DNA/Protein metrics orchestrator
│   └── visualizer.py         # Correlation heatmap with seaborn
├── tests/
│   └── test_pipeline.py      # 29 unit tests (pytest)
├── .env.example              # Environment variables template
├── requirements.txt
├── main.py                   # Orchestrator: DNA + protein metrics
└── setup.py                  # Folder initialization script

Installation

git clone [https://github.com/tu_usuario/BioGen-Predictive-Pipeline.git](https://github.com/tu_usuario/BioGen-Predictive-Pipeline.git)
cd BioGen-Predictive-Pipeline

python -m venv venv
# Windows:
venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate

pip install -r requirements.txt

Configure NCBI Credentials

Copy .env.example to .env and add your email:

cp .env.example .env

ENTREZ_EMAIL=tu_email@ejemplo.com

An email address is required for NCBI API calls. Without it, the --online mode will fail.

Usage

Offline Mode (Local analysis, no NCBI)

python scripts/main_pipeline.py

Online Mode (BLAST + reference download)

python scripts/main_pipeline.py --online

Tests

pytest tests/ -v

Expected output: 29 passed.

Tech Stack

Library	Usage
BioPython ≥1.81	FASTA parsing, BLAST, Entrez, ProteinAnalysis
pandas ≥2.0	DataFrames and CSV export
scikit-learn ≥1.3	Linear regression
seaborn / matplotlib	Correlation heatmap
python-dotenv	Secure credentials management
pytest	Unit testing

Dataset

The sequences included in data/raw/ are public RefSeq records from NCBI:

Gene	Accession	Function
TP53	NM_000546	Supresor tumoral
BRCA1	NM_007294	Reparación de ADN
APOE	NM_000041	Metabolismo lipídico
KCNJ1	NM_001301717	Canal de potasio

Author

Carlos Garcia Corona

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

biogen_analytics

Description

Project Structure

Installation

Configure NCBI Credentials

Usage

Offline Mode (Local analysis, no NCBI)

Online Mode (BLAST + reference download)

Tests

Tech Stack

Dataset

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
data		data
graficos		graficos
logs		logs
resultados		resultados
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LEEME_PYTHON.md		LEEME_PYTHON.md
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

biogen_analytics

Description

Project Structure

Installation

Configure NCBI Credentials

Usage

Offline Mode (Local analysis, no NCBI)

Online Mode (BLAST + reference download)

Tests

Tech Stack

Dataset

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages