Skip to content

CodeeSam/SARS-CoV-2-Genome-Analysis

Repository files navigation

SARS-CoV-2 Genome and Protein Sequence Analysis Using Biopython

This repository contains a beginner-friendly bioinformatics workflow for analyzing the SARS-CoV-2 reference genome and related protein sequence data using Python and Biopython.

The project includes basic genome sequence retrieval, nucleotide composition analysis, GC content calculation, transcription, translation, amino acid composition analysis, BLAST-based protein similarity search, and introductory protein structure visualization.

This project was developed as part of my early bioinformatics and computational biology learning journey and is maintained as part of my broader portfolio in computational research.

Project Overview

SARS-CoV-2 is the virus responsible for COVID-19. Genomic and protein sequence analysis provides important insights into viral structure, sequence composition, encoded proteins, and evolutionary or functional relationships.

This project demonstrates how Python-based bioinformatics tools can be used to retrieve, inspect, and analyze viral genome and protein sequence data.

The analysis is performed mainly using Biopython, a widely used Python library for computational biology and bioinformatics.

Objectives

The main objectives of this project are to:

  • retrieve SARS-CoV-2 genome data from NCBI
  • inspect genome sequence records
  • calculate genome length and molecular weight
  • calculate GC content
  • analyze nucleotide distribution
  • transcribe DNA to RNA
  • translate nucleotide sequence into amino acid sequence
  • inspect amino acid composition
  • identify protein sequences encoded in the genome
  • perform BLAST-based protein similarity search
  • explore introductory protein structure analysis and visualization

Repository Structure

SARS-CoV-2-Genome-Analysis/
├── sars_cov_2_genome_analysis.ipynb
├── sars_cov_2_protein_blast_analysis.ipynb
├── README.md
├── requirements.txt
├── .gitignore
└── LICENSE

Files Description

sars_cov_2_genome_analysis.ipynb

This notebook contains the SARS-CoV-2 genome analysis workflow. It retrieves the SARS-CoV-2 reference genome from NCBI using Biopython and performs basic genome-level analysis.

Main analyses include:

  • sequence retrieval from NCBI
  • genome length calculation
  • molecular weight calculation
  • GC content calculation
  • nucleotide count and distribution
  • DNA-to-RNA transcription
  • translation to amino acid sequence
  • amino acid composition analysis
  • identification of protein sequences

sars_cov_2_protein_blast_analysis.ipynb

This notebook contains protein sequence analysis using Biopython. It reads a protein sequence from a FASTA file, performs BLAST search against the PDB database, inspects BLAST hits, and introduces protein structure parsing and visualization.

Main analyses include:

  • reading protein sequence from FASTA
  • protein sequence inspection
  • BLASTP search using NCBI tools
  • parsing BLAST results
  • examining sequence alignments
  • introductory protein structure analysis using Biopython

protein_seq.fasta

This FASTA file contains the protein sequence used for the BLAST analysis. The protein BLAST notebook expects this file to be available in the repository directory.

requirements.txt

This file lists the Python packages required to run the notebooks.

Methodology

The project follows a simple bioinformatics workflow.

1. Genome Sequence Retrieval

The SARS-CoV-2 reference genome is retrieved from the NCBI nucleotide database using Biopython's Entrez module.

The genome accession used in the notebook is:

MN908947

This accession corresponds to the SARS-CoV-2 reference genome sequence.

2. Genome Inspection

After retrieval, the genome record is inspected to understand the sequence object, metadata, and genome length.

Basic sequence-level properties are calculated, including:

  • sequence length
  • molecular weight
  • GC content

3. Nucleotide Composition Analysis

The distribution of the four DNA nucleotides is calculated:

A, T, C, G

This helps provide a simple overview of the nucleotide composition of the viral genome.

4. Transcription and Translation

The DNA sequence is transcribed into RNA and translated into amino acid sequence.

This demonstrates the central dogma workflow computationally:

DNA → RNA → Protein

5. Amino Acid Composition Analysis

The translated sequence is analyzed to count amino acid frequencies. This provides a basic view of the amino acid distribution in the translated viral sequence.

6. Protein Sequence Analysis

A protein sequence is read from a FASTA file and analyzed using Biopython.

The workflow includes:

  • reading FASTA sequence
  • inspecting sequence ID and description
  • checking protein sequence length
  • performing BLASTP search

7. BLAST Analysis

The protein sequence is searched against the Protein Data Bank database using BLASTP. The BLAST results are parsed and inspected to identify similar protein structures.

8. Protein Structure Visualization

The workflow introduces protein structure handling using Biopython and visualization tools such as nglview.

How to Run the Project

Clone the repository:

git clone https://github.com/CodeeSam/SARS-CoV-2-Genome-Analysis.git
cd SARS-CoV-2-Genome-Analysis

If you keep the old repository name, use:

git clone https://github.com/CodeeSam/Covid_19_genome_Analysis.git
cd Covid_19_genome_Analysis

Install the required dependencies:

pip install -r requirements.txt

Open Jupyter Notebook:

jupyter notebook

Then open and run the notebooks in order.

Requirements

The main Python packages used in this project include:

biopython
matplotlib
nglview
jupyter

A typical requirements.txt file may include:

biopython
matplotlib
nglview
jupyter

Depending on your local setup, nglview may require additional Jupyter widget configuration.

Important Notes

NCBI Entrez Email

The notebook uses Biopython's Entrez module to retrieve data from NCBI. NCBI requires users to provide an email address when using Entrez.

Before running the notebook, update this line with your own email address:

Entrez.email = "your_email@example.com"

Internet Connection Required

The genome retrieval and BLAST search steps require an internet connection because they query NCBI servers.

BLAST Runtime

BLAST searches may take some time depending on NCBI server load and internet connection speed.

Dataset Availability

The genome sequence is retrieved from NCBI, so a separate genome file is not required for the first notebook.

However, the protein BLAST notebook expects a FASTA file named:

protein_seq.fasta

Make sure this file is present in the repository before running the protein analysis notebook.

Example Workflow

NCBI Genome Retrieval → Genome Inspection → Nucleotide Analysis → Transcription → Translation → Amino Acid Analysis

For the protein notebook:

Protein FASTA File → BLASTP Search → BLAST Result Parsing → Protein Structure Exploration

Project Note

This repository represents one of my early bioinformatics practice projects. It is maintained as part of my computational biology and bioinformatics learning archive.

The project demonstrates foundational skills in biological sequence analysis using Python and Biopython.

Applications

This type of project can be useful as a starting point for:

  • beginner bioinformatics practice
  • viral genome analysis
  • biological sequence analysis
  • Biopython learning
  • computational biology training
  • NCBI Entrez and BLAST workflow practice

Recommended Repository Improvements

For better organization, the current repository can later be improved to:

SARS-CoV-2-Genome-Analysis/
├── data/
│   └── protein_seq.fasta
├── notebooks/
│   ├── sars_cov_2_genome_analysis.ipynb
│   └── sars_cov_2_protein_blast_analysis.ipynb
├── results/
│   └── figures/
├── README.md
├── requirements.txt
├── .gitignore
└── LICENSE

Disclaimer

This repository is for educational and computational biology learning purposes only. It is not intended for clinical, diagnostic, epidemiological, or public health decision-making.

License

This repository is released under the MIT License.

About

Bioinformatics analysis of the SARS-CoV-2 reference genome and protein sequence using Biopython.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors