This repository contains a beginner-friendly bioinformatics workflow for analyzing the SARS-CoV-2 reference genome and related protein sequence data using Python and Biopython.
The project includes basic genome sequence retrieval, nucleotide composition analysis, GC content calculation, transcription, translation, amino acid composition analysis, BLAST-based protein similarity search, and introductory protein structure visualization.
This project was developed as part of my early bioinformatics and computational biology learning journey and is maintained as part of my broader portfolio in computational research.
SARS-CoV-2 is the virus responsible for COVID-19. Genomic and protein sequence analysis provides important insights into viral structure, sequence composition, encoded proteins, and evolutionary or functional relationships.
This project demonstrates how Python-based bioinformatics tools can be used to retrieve, inspect, and analyze viral genome and protein sequence data.
The analysis is performed mainly using Biopython, a widely used Python library for computational biology and bioinformatics.
The main objectives of this project are to:
- retrieve SARS-CoV-2 genome data from NCBI
- inspect genome sequence records
- calculate genome length and molecular weight
- calculate GC content
- analyze nucleotide distribution
- transcribe DNA to RNA
- translate nucleotide sequence into amino acid sequence
- inspect amino acid composition
- identify protein sequences encoded in the genome
- perform BLAST-based protein similarity search
- explore introductory protein structure analysis and visualization
SARS-CoV-2-Genome-Analysis/
├── sars_cov_2_genome_analysis.ipynb
├── sars_cov_2_protein_blast_analysis.ipynb
├── README.md
├── requirements.txt
├── .gitignore
└── LICENSE
This notebook contains the SARS-CoV-2 genome analysis workflow. It retrieves the SARS-CoV-2 reference genome from NCBI using Biopython and performs basic genome-level analysis.
Main analyses include:
- sequence retrieval from NCBI
- genome length calculation
- molecular weight calculation
- GC content calculation
- nucleotide count and distribution
- DNA-to-RNA transcription
- translation to amino acid sequence
- amino acid composition analysis
- identification of protein sequences
This notebook contains protein sequence analysis using Biopython. It reads a protein sequence from a FASTA file, performs BLAST search against the PDB database, inspects BLAST hits, and introduces protein structure parsing and visualization.
Main analyses include:
- reading protein sequence from FASTA
- protein sequence inspection
- BLASTP search using NCBI tools
- parsing BLAST results
- examining sequence alignments
- introductory protein structure analysis using Biopython
This FASTA file contains the protein sequence used for the BLAST analysis. The protein BLAST notebook expects this file to be available in the repository directory.
This file lists the Python packages required to run the notebooks.
The project follows a simple bioinformatics workflow.
The SARS-CoV-2 reference genome is retrieved from the NCBI nucleotide database using Biopython's Entrez module.
The genome accession used in the notebook is:
MN908947
This accession corresponds to the SARS-CoV-2 reference genome sequence.
After retrieval, the genome record is inspected to understand the sequence object, metadata, and genome length.
Basic sequence-level properties are calculated, including:
- sequence length
- molecular weight
- GC content
The distribution of the four DNA nucleotides is calculated:
A, T, C, G
This helps provide a simple overview of the nucleotide composition of the viral genome.
The DNA sequence is transcribed into RNA and translated into amino acid sequence.
This demonstrates the central dogma workflow computationally:
DNA → RNA → Protein
The translated sequence is analyzed to count amino acid frequencies. This provides a basic view of the amino acid distribution in the translated viral sequence.
A protein sequence is read from a FASTA file and analyzed using Biopython.
The workflow includes:
- reading FASTA sequence
- inspecting sequence ID and description
- checking protein sequence length
- performing BLASTP search
The protein sequence is searched against the Protein Data Bank database using BLASTP. The BLAST results are parsed and inspected to identify similar protein structures.
The workflow introduces protein structure handling using Biopython and visualization tools such as nglview.
Clone the repository:
git clone https://github.com/CodeeSam/SARS-CoV-2-Genome-Analysis.git
cd SARS-CoV-2-Genome-AnalysisIf you keep the old repository name, use:
git clone https://github.com/CodeeSam/Covid_19_genome_Analysis.git
cd Covid_19_genome_AnalysisInstall the required dependencies:
pip install -r requirements.txtOpen Jupyter Notebook:
jupyter notebookThen open and run the notebooks in order.
The main Python packages used in this project include:
biopython
matplotlib
nglview
jupyter
A typical requirements.txt file may include:
biopython
matplotlib
nglview
jupyter
Depending on your local setup, nglview may require additional Jupyter widget configuration.
The notebook uses Biopython's Entrez module to retrieve data from NCBI. NCBI requires users to provide an email address when using Entrez.
Before running the notebook, update this line with your own email address:
Entrez.email = "your_email@example.com"The genome retrieval and BLAST search steps require an internet connection because they query NCBI servers.
BLAST searches may take some time depending on NCBI server load and internet connection speed.
The genome sequence is retrieved from NCBI, so a separate genome file is not required for the first notebook.
However, the protein BLAST notebook expects a FASTA file named:
protein_seq.fasta
Make sure this file is present in the repository before running the protein analysis notebook.
NCBI Genome Retrieval → Genome Inspection → Nucleotide Analysis → Transcription → Translation → Amino Acid Analysis
For the protein notebook:
Protein FASTA File → BLASTP Search → BLAST Result Parsing → Protein Structure Exploration
This repository represents one of my early bioinformatics practice projects. It is maintained as part of my computational biology and bioinformatics learning archive.
The project demonstrates foundational skills in biological sequence analysis using Python and Biopython.
This type of project can be useful as a starting point for:
- beginner bioinformatics practice
- viral genome analysis
- biological sequence analysis
- Biopython learning
- computational biology training
- NCBI Entrez and BLAST workflow practice
For better organization, the current repository can later be improved to:
SARS-CoV-2-Genome-Analysis/
├── data/
│ └── protein_seq.fasta
├── notebooks/
│ ├── sars_cov_2_genome_analysis.ipynb
│ └── sars_cov_2_protein_blast_analysis.ipynb
├── results/
│ └── figures/
├── README.md
├── requirements.txt
├── .gitignore
└── LICENSE
This repository is for educational and computational biology learning purposes only. It is not intended for clinical, diagnostic, epidemiological, or public health decision-making.
This repository is released under the MIT License.