SARS-CoV-2 Genome and Protein Sequence Analysis Using Biopython

This repository contains a beginner-friendly bioinformatics workflow for analyzing the SARS-CoV-2 reference genome and related protein sequence data using Python and Biopython.

The project includes basic genome sequence retrieval, nucleotide composition analysis, GC content calculation, transcription, translation, amino acid composition analysis, BLAST-based protein similarity search, and introductory protein structure visualization.

This project was developed as part of my early bioinformatics and computational biology learning journey and is maintained as part of my broader portfolio in computational research.

Project Overview

SARS-CoV-2 is the virus responsible for COVID-19. Genomic and protein sequence analysis provides important insights into viral structure, sequence composition, encoded proteins, and evolutionary or functional relationships.

This project demonstrates how Python-based bioinformatics tools can be used to retrieve, inspect, and analyze viral genome and protein sequence data.

The analysis is performed mainly using Biopython, a widely used Python library for computational biology and bioinformatics.

Objectives

The main objectives of this project are to:

retrieve SARS-CoV-2 genome data from NCBI
inspect genome sequence records
calculate genome length and molecular weight
calculate GC content
analyze nucleotide distribution
transcribe DNA to RNA
translate nucleotide sequence into amino acid sequence
inspect amino acid composition
identify protein sequences encoded in the genome
perform BLAST-based protein similarity search
explore introductory protein structure analysis and visualization

Repository Structure

SARS-CoV-2-Genome-Analysis/
├── sars_cov_2_genome_analysis.ipynb
├── sars_cov_2_protein_blast_analysis.ipynb
├── README.md
├── requirements.txt
├── .gitignore
└── LICENSE

Files Description

`sars_cov_2_genome_analysis.ipynb`

This notebook contains the SARS-CoV-2 genome analysis workflow. It retrieves the SARS-CoV-2 reference genome from NCBI using Biopython and performs basic genome-level analysis.

Main analyses include:

sequence retrieval from NCBI
genome length calculation
molecular weight calculation
GC content calculation
nucleotide count and distribution
DNA-to-RNA transcription
translation to amino acid sequence
amino acid composition analysis
identification of protein sequences

`sars_cov_2_protein_blast_analysis.ipynb`

This notebook contains protein sequence analysis using Biopython. It reads a protein sequence from a FASTA file, performs BLAST search against the PDB database, inspects BLAST hits, and introduces protein structure parsing and visualization.

Main analyses include:

reading protein sequence from FASTA
protein sequence inspection
BLASTP search using NCBI tools
parsing BLAST results
examining sequence alignments
introductory protein structure analysis using Biopython

`protein_seq.fasta`

This FASTA file contains the protein sequence used for the BLAST analysis. The protein BLAST notebook expects this file to be available in the repository directory.

`requirements.txt`

This file lists the Python packages required to run the notebooks.

Methodology

The project follows a simple bioinformatics workflow.

1. Genome Sequence Retrieval

The SARS-CoV-2 reference genome is retrieved from the NCBI nucleotide database using Biopython's Entrez module.

The genome accession used in the notebook is:

MN908947

This accession corresponds to the SARS-CoV-2 reference genome sequence.

2. Genome Inspection

After retrieval, the genome record is inspected to understand the sequence object, metadata, and genome length.

Basic sequence-level properties are calculated, including:

sequence length
molecular weight
GC content

3. Nucleotide Composition Analysis

The distribution of the four DNA nucleotides is calculated:

A, T, C, G

This helps provide a simple overview of the nucleotide composition of the viral genome.

4. Transcription and Translation

The DNA sequence is transcribed into RNA and translated into amino acid sequence.

This demonstrates the central dogma workflow computationally:

DNA → RNA → Protein

5. Amino Acid Composition Analysis

The translated sequence is analyzed to count amino acid frequencies. This provides a basic view of the amino acid distribution in the translated viral sequence.

6. Protein Sequence Analysis

A protein sequence is read from a FASTA file and analyzed using Biopython.

The workflow includes:

reading FASTA sequence
inspecting sequence ID and description
checking protein sequence length
performing BLASTP search

7. BLAST Analysis

The protein sequence is searched against the Protein Data Bank database using BLASTP. The BLAST results are parsed and inspected to identify similar protein structures.

8. Protein Structure Visualization

The workflow introduces protein structure handling using Biopython and visualization tools such as nglview.

How to Run the Project

Clone the repository:

git clone https://github.com/CodeeSam/SARS-CoV-2-Genome-Analysis.git
cd SARS-CoV-2-Genome-Analysis

If you keep the old repository name, use:

git clone https://github.com/CodeeSam/Covid_19_genome_Analysis.git
cd Covid_19_genome_Analysis

Install the required dependencies:

pip install -r requirements.txt

Open Jupyter Notebook:

jupyter notebook

Then open and run the notebooks in order.

Requirements

The main Python packages used in this project include:

biopython
matplotlib
nglview
jupyter

A typical requirements.txt file may include:

biopython
matplotlib
nglview
jupyter

Depending on your local setup, nglview may require additional Jupyter widget configuration.

Important Notes

NCBI Entrez Email

The notebook uses Biopython's Entrez module to retrieve data from NCBI. NCBI requires users to provide an email address when using Entrez.

Before running the notebook, update this line with your own email address:

Entrez.email = "your_email@example.com"

Internet Connection Required

The genome retrieval and BLAST search steps require an internet connection because they query NCBI servers.

BLAST Runtime

BLAST searches may take some time depending on NCBI server load and internet connection speed.

Dataset Availability

The genome sequence is retrieved from NCBI, so a separate genome file is not required for the first notebook.

However, the protein BLAST notebook expects a FASTA file named:

protein_seq.fasta

Make sure this file is present in the repository before running the protein analysis notebook.

Example Workflow

NCBI Genome Retrieval → Genome Inspection → Nucleotide Analysis → Transcription → Translation → Amino Acid Analysis

For the protein notebook:

Protein FASTA File → BLASTP Search → BLAST Result Parsing → Protein Structure Exploration

Project Note

This repository represents one of my early bioinformatics practice projects. It is maintained as part of my computational biology and bioinformatics learning archive.

The project demonstrates foundational skills in biological sequence analysis using Python and Biopython.

Applications

This type of project can be useful as a starting point for:

beginner bioinformatics practice
viral genome analysis
biological sequence analysis
Biopython learning
computational biology training
NCBI Entrez and BLAST workflow practice

Recommended Repository Improvements

For better organization, the current repository can later be improved to:

SARS-CoV-2-Genome-Analysis/
├── data/
│   └── protein_seq.fasta
├── notebooks/
│   ├── sars_cov_2_genome_analysis.ipynb
│   └── sars_cov_2_protein_blast_analysis.ipynb
├── results/
│   └── figures/
├── README.md
├── requirements.txt
├── .gitignore
└── LICENSE

Disclaimer

This repository is for educational and computational biology learning purposes only. It is not intended for clinical, diagnostic, epidemiological, or public health decision-making.

License

This repository is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
sars_cov_2_genome_analysis.ipynb		sars_cov_2_genome_analysis.ipynb
sars_cov_2_protein_blast_analysis.ipynb		sars_cov_2_protein_blast_analysis.ipynb

Folders and files

Latest commit

History

Repository files navigation

SARS-CoV-2 Genome and Protein Sequence Analysis Using Biopython

Project Overview

Objectives

Repository Structure

Files Description

sars_cov_2_genome_analysis.ipynb

sars_cov_2_protein_blast_analysis.ipynb

protein_seq.fasta

requirements.txt

Methodology

1. Genome Sequence Retrieval

2. Genome Inspection

3. Nucleotide Composition Analysis

4. Transcription and Translation

5. Amino Acid Composition Analysis

6. Protein Sequence Analysis

7. BLAST Analysis

8. Protein Structure Visualization

How to Run the Project

Requirements

Important Notes

NCBI Entrez Email

Internet Connection Required

BLAST Runtime

Dataset Availability

Example Workflow

Project Note

Applications

Recommended Repository Improvements

Disclaimer

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`sars_cov_2_genome_analysis.ipynb`

`sars_cov_2_protein_blast_analysis.ipynb`

`protein_seq.fasta`

`requirements.txt`

Packages