Skip to content

KorkinLab/LLM-Complexes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

510 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-guided Large Protein Complexes Data Retrieval Pipeline

This repository contains a data processing and evaluation pipeline for analyzing protein complexes generated by various Large Language Models (LLMs) such as ChatGPT, Perplexity, Claude, Deepseek, Llama, and Gemini.

The end-to-end pipeline is orchestrated by pipeline.py, which automates data integration, consensus tracking, metric calculation, and visualization plotting.

How to Run the Pipeline

You can run the full pipeline simply by executing:

python pipeline.py

Command Line Arguments

The pipeline supports skipping specific steps if you only want to run parts of the analysis:

  • --skip-steps 1 2: Skips the given step numbers (e.g., this skips Step 1 and Step 2).
  • --skip-first-three: A shortcut flag to skip the data integration, consensus, and metric calculation steps (Steps 1, 2, and 3). This is useful if you just want to quickly regenerate plots using previously computed data.

Pipeline Steps Details

Step 1: Integrating Complexes

  • Scripts: integrating_complexes_f1.py, integrating_complexes_bridges.py
  • Description: Processes the raw generated complex files from various LLMs located in the sources/ directory. It evaluates and refines these into base complexes (True Positives) and extracts "bridges" connecting different complexes. The refined outputs are saved into the integrated_tp/ and integrated_bridges/ directories.

Step 2: Consensus Code (Voting)

  • Scripts: integrating_complexes_voting.py
  • Description: Applies a consensus or voting mechanism across the different LLMs' outputs (both raw and bridged). This aggregates the individual models' predictions into robust consensus datasets and saves them into the integrated_voting/ directory.

Step 3: Metrics Calculation (F1 and Graph Density)

  • Scripts: calculate_f1.py, calculate_graph_density.py, calculate_graph_density_stringdb.py
  • Description: Evaluates all generated datasets (raw outputs, base complexes, bridged complexes, and consensus models) against ground truth data such as verified_complexes.json and STRING DB. It computes key performance metrics including F1 scores and Graph Density, outputting the results to their respective directories (results_f1/, results_graph_density/, and results_graph_density_sdb/).

Step 4: Regenerate Main Plots

  • Scripts: figure_plotting/_regenerate_plots.py
  • Description: Reads the metric results calculated in Step 3 to automatically generate the primary figures and charts used for analysis and presentation.

Step 5: Supplemental Scripts

  • Scripts: plotting/wordcloud.py and various scripts in supplemental_plotting/ (e.g., heatmap_consensus_comparison.py, llm_comparison.py, scatterplots)
  • Description: Generates supplementary visualizations to provide deeper insight into the data. These include word clouds, heatmaps comparing consensus methods, LLM head-to-head comparisons, and scatter plots correlating F1 scores with graph densities.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors