Skip to content

AsiaSzych/Tree_of_Life

Repository files navigation

Tree_of_Life

This project is dedicated to solving one of the Nifty Assignments from 2025 - Building The Tree of Life from Scratch, created by Christopher Tralie, PhD.

This assignment explores how to reconstruct and compare phylogenetic trees from DNA sequences, simulating how evolutionary relationships are inferred. The assignment is divided into three logical parts:

  • computing pairwise genetic distances using Needleman-Wunsh algorithm
  • building phylogenetic trees using single linkage clustering
  • identifying clusters that represent closely related groups

In addition to official scope, I added a task to export create tree to Newick format and save found clusters to text file.

Project assumptions

The project aims at finding out which of the selected Large Language Models will be the best at solving beforementioned assignment in Python and Java programming languages. To find this out, a set of experiments was preformed.

Each experiment consisted of running predefined set of prompts on chosen LLM, via API, using langchain framework to create a run flow. This flow can be found in LLM_experiments/experiments_run.py file. To run this file it is assumed that in the same folder, exists a file called api_keys.json, containig API KEYS for each of the used APIs.

To give each model a fair chance at solving the assignment, there are few restrictions:

  • each model is given exactly the same set of earlier prepared prompts
    • LLM_experiments/prompts_final_python.yaml file contains prompts for obtaing solution in Python 3.11
    • LLM_experiments/prompts_final_java.yaml file contains prompts for obtaing solution in Java 21
  • each model has temperature set to 0.2 and max number of tokens set to 4096
  • each model was used 3 times and provided 3 different solutions, best solution was chosen based on its correctness (percentage of test cases passed) and its cleanliness (SonarQube result)

Model tested in current version of this project:

Folder structure

  • LLM_experiments - folder containing logs from runs on selected LLMs and the code solution generated by those LLMs
  • reference_implementation - folder containing solution prepared by the repository owner
  • starter_code - folder containing data and starter code from the nifty page, with additional saving of BLOSUM table to json file
  • tests - folder containing tests definition for comparting LLM solution with reference solution

Inside LLM_experiments folder there are data gathered from experiments. They follow a predefined structure:

|-- python/java - folder indicating wherther solution was prepared in python or java
|---- results_{llm_name} - folder containg all solution for given language from selected LLM
|------ solution_try_{number} - folder containg logs and code for one of the runs, each run has different number
|-------- conversation_{llm_name}_{python/java}try{number}.md - markdown file containg all prompts and responses
|-------- code files with solution
|-------- readMe.md file generated by the LLM
|-------- requirements.txt/pom.xml
|-------- input data files
|-------- output files from code run

Input data files are always the same:

  • blosum50.json - BLOSUM50 table in json format
  • blosum62.json - BLOSUM62 table in json format
  • organisms.json - DNA strings for 71 organisms
  • thresholds.txt - thresholds values for clustering task

Each code generated by LLM was run using BLOSUM 62 table, so in each folder there should be 5 output files:

  • organisms_scores_blosum62.json - file containing needleman-wunsch distances for all specices pairs
  • tree_blosum62_newick.nw - file containing created tree exported to Newick format with only leaf nodes' names, for example: "(A,B,(C,D));"
  • tree_blosum62_newick_with_distance.nw - file containing created tree exported to Newick format with both leaf nodes' names and distances between branches in the tree, for example: "(A:1,B:2,(C:3,D:4):5);"
  • phylogenetic_tree_blosum62.png - file containing visualisation of created phylogenetic tree
  • clusters_for_blosum62.json - file containing found clasters for each of the thresholds from thresholds.txt file

About

Nifty Assignments 2025 - Building The Tree of Life from Scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors