This repository contains machine learning projects applied to different biological domains.
AlphaGenome - Gene Expression Regression
MergeDNA - Masked Language Modelling (MLM) Pre-training
NucleotideTransformer - Gene Family Classification
A notebook evaluating AlphaGenome's RNA-seq scores against cancer-associated gene variants using experimentally observed RNA-seq data from lung cancer tissue samples. The aim of this evaluation is to determine if AlphaGenome can be used to identify cancer vaccine targets that are likely to be poorly expressed.
See here: alpha_genome_performance.ipynb
A notebook demonstrating an implementation of the MergeDNA paper.
See here: merge_dna_demo.ipynb
This implementation was written completely in pytorch and includes:
- The MergeDNA architecture:
- Local Encoder with windowed attention and windowed DTEM.
- Latent Encoder with global attention and BSM for the latent reconstruction task.
- Latent Decoder with global attention.
- Local Decoder with windowed attention.
- Multi-objective pretraining on the NT Genome Multi-Species dataset:
- Full sequence reconstruction from local encoder compressed embeddings.
- Full sequence reconstruction from latent encoder compressed embeddings (frozen local encoder).
- Adaptive masked token modelling derived from latent encoder source matrix.
We assess the predictive performance of two models at classifying a given gene’s DNA into the correct gene family. The two models we assessed were a “naive” kmer count logistic regression (KCLR) model and the “refined” Nucleotide Transformer 50M (NT50M) model fine-tuned with IA3.
See data analysis: dna_seq_families_analysis.ipynb
See code: gene_family/scripts
See report: gene_family_report.pdf
🚧 Coming Soon 🚧
🚧 Coming Soon 🚧
🚧 Coming Soon 🚧