You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Explore_matrices_and_get_data_for_multi_model.ipynb - in this notebook we explore the UniBind+Remap matrix and collect data to train a 50 TF multi-model;
Train_multi_model.ipynb - in this notebook we train a CNN multi-model with 50 TFs and generate the results;
Train_multi_model_DanQ.ipynb - in this notebook we train a DanQ multi-model with 50 TFs and generate the results;
TL_exploring_pentad_TFs.ipynb - in this notebook we explore the results of training with biologically relevant groups (with TF pentad) using a CNN model;
TL_exploring_pentad_TFs_DanQ.ipynb - in this notebook we explore the results of training with biologically relevant groups (with TF pentad) using a DanQ model;
TL_effect_of_multimodels.ipynb - effect of different multi-models (5 vs 50) on the individual model performance (for pentad TFs);
TL_individual_models.ipynb - plotting the results for the individual models vs multi-model; plotting the box plots for the effect of TL;
TL_with_freezed_layers.ipynb - inspecting the TL performance, when convolutional layers are freezed;
Exploring_effect_of_BM_on_15_TFs.ipynb - in this notebook we lot the effect of BM on TL for 15 TFs (3 TF for each of the pentad family);
Model_interpretation.ipynb - here we perform multi- and individual models interpretation by converting first layer convolutional filters into PWMs;
Interpretation_of_models_finetuned_with_cofactors.ipynb - interpreting individual models that were preinitialized with weights from multi-models trained on cofactors;
Data_size_effect_on_TL.ipynb - exploring how TL affects the performance for different sub-sampled data sets;
UMAP_Binding_heatmap_and_selecting_groups.ipynb - in this notebook we analyze the binding matrix by plotting UMAP plot and the heatmap of binding pattern similarities. Moreover, we select biologically relevant groups for TL;
Bash scripts
run_training_of_individual_models.sh - run to train 148 individual TF models from scratch or using 50 TF multi-model to initialize weights;
run_training_of_individual_models_FREEZING_LAYERS.sh - run to train 148 individual TF models from scratch or using 50 TF multi-model to initialize weights; this time convolutional layers are freezed;
run_single_tf_refined_subsample_experiment.sh - run to test TL boundaries by sub-sampling different numbers of positive regions;
run_BM_real_TFs_last_exp.sh (runs run_BM_tl_last_exp_corrected_remove.sh) - trains models with TL for 15 TFs;
run_BM_multimodel_TFs.sh (runs run_BM_multimodel.sh) - perform TL using either 50 or 5 TF multi-model;
run_BM_real_TFs.sh (runs run_BM_tl_subsample.sh) - train individual models using TL with TFs from the same BM; for speed, subsample data sets to 1000 positives/negatives; also runs run_BM_tl_subsample_DanQ.sh - same as above but for DanQ;
run_cofactors_real_TFs.sh (runs run_cofactor_tl_subsample.sh) - train individual models using TL with TFs that are cofactors/STRING partners/low correlated TFs with the same BM; for speed, subsample data sets to 1000 positives/negatives;
get_indiv_data_for_each_TF.sh - get data splits for individual TFs;
Python scripts
split_the_dataset.py - script that takes as input fasta files and labels and splits the data into train/validation/test sets; one-hot encodes (and reverse complements if required) the sequences;
Run_Analysis_Training.py - trains a multi-model;
Run_String_Analysis_Training.py - same as above, but saves class labels in a separate file;
remove_training_data.py - script for removing data used to train a multi-model from the original TF binding matrix;
Run_Analysis_String_Transfer_Learning.py - same as above, but accepts class labels to use during the testing;
Run_Analysis_Transfer_Learning_Subsampling.py - same as above, but takes as input a specified test data set (used during testing TL boundaries);
get_data_for_TF.py - script to build fasta and labels files for a specific TF;
get_data_for_TF_subsample_positives_old.py - same as above, but subsamples data to a certain number of positives/negatives
get_data_for_TF_subsample_positives.py - same as above, but also subsamples a certain number of test sequences;
Run_BM_Analysis.py - generates fasta and labels files for a specific TF and binding mode; the final subsampled data cannot be less than 70,000 sequences;
Run_BM_Analysis_LE.py - same as above, but no restriction on subsampled data size;
Run_BM_Multimodel_Analysis.py - same as above, but randomly samples 40,000 regions, and saves labels for 50 and 5 classes multi-model;
Run_Cofactor_Analysis.py - generates fasta and labels files for a specific TF and its biological group (cofactors, string, low correlated BM); the final subsampled data cannot be less than 70,000 sequences;
Run_Cofactor_Multimodel_Analysis.py - same as above, but randomly samples 40,000 regions, and saves labels for 50 and 5 classes multi-model;
split_the_dataset_bm_multimodel.py - splits data set for different multimodels;
models.py - python script with model architectures;
tools.py - python script with functions used to analyze the data;
deeplift_scores.py - python script to compute DeepLIFT importance scores using the Captum library;
About
IPython notebooks and scripts for performing the different transfer learning experiments