Bioinformatics pipeline for genomic analysis of FASTA sequences: DNA/protein metrics calculation, NCBI homology search, statistical modeling with linear regression, and anomaly detection.
This project analyzes human DNA sequences (FASTA format) by applying a 4-phase pipeline:
- Genomic Processing — GC%, length, DNA→protein translation, molecular weight, aromaticity.
- Online Homology (optional) — BLAST search against the NCBI
ntdatabase and reference downloading. - AI Model — Linear regression (length → GC%) with anomaly detection (>2σ).
- Visualization — Correlation heatmap generated with seaborn.
The included sample genes are: TP53, BRCA1, APOE, and KCNJ1.
text
Biogen_analytics/
├── data/
│ ├── raw/ # Input FASTAs (.fasta / .fa)
│ ├── processed/ # features_genes.csv (generated)
│ └── reference/ # References downloaded from NCBI (online mode)
├── graficos/ # heatmap_correlacion.png (generated)
├── logs/ # pipeline.log (generated)
├── resultados/ # Results CSV and PNG (generated)
├── src/
│ ├── __init__.py # CLI entry point
│ ├── descarga_sec.py # Bulk download of FASTAs from NCBI
│ └── fetch_tools.py # Reference FASTA downloading from NCBI
│ ├── homology_search.py # BLAST + GenBank details extraction
│ ├── model_ia.py # Linear regression + anomaly detector
│ ├── procesado_sec.py # DNA/Protein metrics orchestrator
│ └── visualizer.py # Correlation heatmap with seaborn
├── tests/
│ └── test_pipeline.py # 29 unit tests (pytest)
├── .env.example # Environment variables template
├── requirements.txt
├── main.py # Orchestrator: DNA + protein metrics
└── setup.py # Folder initialization script
git clone [https://github.com/tu_usuario/BioGen-Predictive-Pipeline.git](https://github.com/tu_usuario/BioGen-Predictive-Pipeline.git)
cd BioGen-Predictive-Pipeline
python -m venv venv
# Windows:
venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate
pip install -r requirements.txtCopy .env.example to .env and add your email:
cp .env.example .envENTREZ_EMAIL=tu_email@ejemplo.comAn email address is required for NCBI API calls. Without it, the
--onlinemode will fail.
python scripts/main_pipeline.pypython scripts/main_pipeline.py --onlinepytest tests/ -vExpected output: 29 passed.
| Library | Usage |
|---|---|
| BioPython ≥1.81 | FASTA parsing, BLAST, Entrez, ProteinAnalysis |
| pandas ≥2.0 | DataFrames and CSV export |
| scikit-learn ≥1.3 | Linear regression |
| seaborn / matplotlib | Correlation heatmap |
| python-dotenv | Secure credentials management |
| pytest | Unit testing |
The sequences included in data/raw/ are public RefSeq records from NCBI:
| Gene | Accession | Function |
|---|---|---|
| TP53 | NM_000546 | Supresor tumoral |
| BRCA1 | NM_007294 | Reparación de ADN |
| APOE | NM_000041 | Metabolismo lipídico |
| KCNJ1 | NM_001301717 | Canal de potasio |
Carlos Garcia Corona