LLM-guided Large Protein Complexes Data Retrieval Pipeline

This repository contains a data processing and evaluation pipeline for analyzing protein complexes generated by various Large Language Models (LLMs) such as ChatGPT, Perplexity, Claude, Deepseek, Llama, and Gemini.

The end-to-end pipeline is orchestrated by pipeline.py, which automates data integration, consensus tracking, metric calculation, and visualization plotting.

How to Run the Pipeline

You can run the full pipeline simply by executing:

python pipeline.py

Command Line Arguments

The pipeline supports skipping specific steps if you only want to run parts of the analysis:

--skip-steps 1 2: Skips the given step numbers (e.g., this skips Step 1 and Step 2).
--skip-first-three: A shortcut flag to skip the data integration, consensus, and metric calculation steps (Steps 1, 2, and 3). This is useful if you just want to quickly regenerate plots using previously computed data.

Pipeline Steps Details

Step 1: Integrating Complexes

Scripts: integrating_complexes_f1.py, integrating_complexes_bridges.py
Description: Processes the raw generated complex files from various LLMs located in the sources/ directory. It evaluates and refines these into base complexes (True Positives) and extracts "bridges" connecting different complexes. The refined outputs are saved into the integrated_tp/ and integrated_bridges/ directories.

Step 2: Consensus Code (Voting)

Scripts: integrating_complexes_voting.py
Description: Applies a consensus or voting mechanism across the different LLMs' outputs (both raw and bridged). This aggregates the individual models' predictions into robust consensus datasets and saves them into the integrated_voting/ directory.

Step 3: Metrics Calculation (F1 and Graph Density)

Scripts: calculate_f1.py, calculate_graph_density.py, calculate_graph_density_stringdb.py
Description: Evaluates all generated datasets (raw outputs, base complexes, bridged complexes, and consensus models) against ground truth data such as verified_complexes.json and STRING DB. It computes key performance metrics including F1 scores and Graph Density, outputting the results to their respective directories (results_f1/, results_graph_density/, and results_graph_density_sdb/).

Step 4: Regenerate Main Plots

Scripts: figure_plotting/_regenerate_plots.py
Description: Reads the metric results calculated in Step 3 to automatically generate the primary figures and charts used for analysis and presentation.

Step 5: Supplemental Scripts

Scripts: plotting/wordcloud.py and various scripts in supplemental_plotting/ (e.g., heatmap_consensus_comparison.py, llm_comparison.py, scatterplots)
Description: Generates supplementary visualizations to provide deeper insight into the data. These include word clouds, heatmaps comparing consensus methods, LLM head-to-head comparisons, and scatter plots correlating F1 scores with graph densities.

Name		Name	Last commit message	Last commit date
Latest commit History 510 Commits
__pycache__		__pycache__
assistance_files		assistance_files
cache_files		cache_files
chimera		chimera
figure_plotting		figure_plotting
generators		generators
integrated_bridges		integrated_bridges
integrated_tp		integrated_tp
integrated_voting		integrated_voting
molecular_machine_verification		molecular_machine_verification
plots		plots
plotting		plotting
results		results
results_f1		results_f1
results_graph_density		results_graph_density
results_graph_density_sdb		results_graph_density_sdb
sources		sources
supplemental_plotting		supplemental_plotting
texts		texts
tmp		tmp
.gitignore		.gitignore
Poster_WPI_Undergraduate_Research_Showcase_LLM_Protein_Complexes.pdf		Poster_WPI_Undergraduate_Research_Showcase_LLM_Protein_Complexes.pdf
README.md		README.md
best_score_log.txt		best_score_log.txt
calculate_f1.py		calculate_f1.py
calculate_graph_density.py		calculate_graph_density.py
calculate_graph_density_stringdb.py		calculate_graph_density_stringdb.py
constants.py		constants.py
gen.py		gen.py
integrating_complexes_bridges.py		integrating_complexes_bridges.py
integrating_complexes_f1.py		integrating_complexes_f1.py
integrating_complexes_voting.py		integrating_complexes_voting.py
methodology.md		methodology.md
pipeline.md		pipeline.md
pipeline.py		pipeline.py
rename_script.py		rename_script.py
requirements.txt		requirements.txt
step3_filepaths.py		step3_filepaths.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-guided Large Protein Complexes Data Retrieval Pipeline

How to Run the Pipeline

Command Line Arguments

Pipeline Steps Details

Step 1: Integrating Complexes

Step 2: Consensus Code (Voting)

Step 3: Metrics Calculation (F1 and Graph Density)

Step 4: Regenerate Main Plots

Step 5: Supplemental Scripts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-guided Large Protein Complexes Data Retrieval Pipeline

How to Run the Pipeline

Command Line Arguments

Pipeline Steps Details

Step 1: Integrating Complexes

Step 2: Consensus Code (Voting)

Step 3: Metrics Calculation (F1 and Graph Density)

Step 4: Regenerate Main Plots

Step 5: Supplemental Scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages