Comprehensive IRES Identification, Directed Mutation, and De Novo Generation Using AI-Driven Methods
Internal ribosome entry sites (IRES) within RNA enable ribosome recruitment and translation initiation, especially during cap-independent translation initiation. IRES are potential therapeutic targets for regulating viral protein expression. Our study presents an innovative framework for IRES research. IRES-LM, a language model transferred from UTR-LM and RNA-FM, is trained on 46,774 sequences to predict linear mRNA IRESs, outperforming existing methods with a 15% improvement in AUC and F1 score. Additionally, IRES-LM demonstrates remarkable cross-applicability to circRNA IRESs, correctly identifying all experimentally validated 21 IRESs and outperforming benchmark methods. Next, IRES-EA combines IRES-LM with an evolutionary algorithm for directed mutation. We tested 37,293 non-IRES sequences, successfully mutating 22,243 into IRES sequences. Moreover, experimental validation shows that 1 out of 7 model-generated EMCV IRES variants achieved higher activity than wild-type EMCV IRES, where secondary structure comparison identifies potential key elements for future engineering. For CVB3 IRES variants, 1 out of 10 model-generated mutants achieved comparable activity to the wild-type, with all mutated sequences exhibiting detectable activity above the negative control. Further, IRES-DM employs a conditional diffusion model to de novo generate novel IRES sequences, resembling natural IRES in structure and function, exemplified by a generated sequence sharing only 27.6% sequence similarity with BiP IRES while maintaining similar secondary structure. Experimental testing of two generated sequences similar to VCIP IRES demonstrated substantial IRES activity despite being only 95-nt in length. This comprehensive framework advances IRES identification, optimization, and design, highlighting its potential for enhancing RNA vaccine efficacy and broader therapeutic applications.
This framework consists of three main components:
-
IRES Language Model for IRES Identification (IRES-LM)
- IRES-UTRLM: A language model fine-tuned on untranslated regions (UTRs)
- IRES-RNAFM: A language model based on non-coding RNA sequences
-
IRES Evolutionary Algorithm (IRES-EA)
- Enhancing IRES activity through guided mutations
- Supporting multiple mutation strategies
-
IRES Diffusion Model (IRES-DM)
- De novo generation of functional IRES sequences
- Two training strategies: reward-guided and directed training
./Data/v2_dataset_with_unified_stratified_shuffle_train_test_split.csv: Contains 9,072 IRES and 37,602 non-IRES sequences./Data/circires_case.fa: 21 experimentally validated circRNA IRESs collected from DeepCIP
Typical install time on a "normal" desktop computer < 1 hour
git clone https://github.com/a96123155/IRES_Prediction_Design.git
cd IRES_Prediction_Design
conda create -n ires python==3.9 # or 3.12
pip install requirements.txtIRES-LM combines two fine-tuned language models: IRES-UTRLM (based on untranslated regions) and IRES-RNAFM (based on non-coding RNA sequences).
cd ./Script
CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port 8890 IRES_RNAFM.py --device_ids 0 --bos_emb --truncate --finetune_esm --prefix IRES_RNAFMNamespace(device_ids='0', local_rank=0, seed=1337, prefix='IRES_RNAFM', epochs=10, nodes=40, dropout3=0.5, lr=0.0001, folds=10, random_init=False, avg_emb=False, bos_emb=True, epoch_without_improvement=10, pos_label_weight=-1, cls_loss_weight=2.0, mlm_loss_weight=1, finetune_esm=True, finetune_sixth_layer_esm=False, truncate=True, truncate_num=1024, mask_prob=0.15, batch_toks=4096)
./results/IRES_RNAFM_CLS2_10folds_AvgEmbFalse_BosEmbTrue_epoch10_nodes40_dropout30.5_finetuneESMTrue_finetuneLastLayerESMFalse_lr0.0001_metrics.csv
./models/IRES_RNAFM_best_model_fold[0-9].ptplease find in link
../models/UTRLM_pretrained_model.pklplease find in link
cd ./Script
CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port 8889 IRES_UTRLM.py --device_ids 0 --bos_emb --finetune_esm --prefix IRES_UTRLMNamespace(device_ids='0', local_rank=3, seed=1337, prefix='IRES_UTRLM', epochs=100, nodes=40, dropout3=0.5, lr=0.0001, random_init=False, modelfile='UTRLM_pretrained_model.pkl', avg_emb=False, bos_emb=True, epoch_without_improvement=10, pos_label_weight=-1, cls_loss_weight=20.0, mlm_loss_weight=1, finetune_esm=True, finetune_sixth_layer_esm=False, truncate=False, truncate_num=50, mask_prob=0.15, batch_toks=2048)
./results/IRES_UTRLM_FinetuneESM_lr1e-4_dr5_bos_CLS20_10folds_F0_AvgEmbFalse_BosEmbTrue_epoch100_nodes40_dropout30.5_finetuneESMTrue_finetuneLastLayerESMFalse_lr0.0001_metrics.csv
./models/IRES_UTRLM_best_model_fold[0-9].ptplease find in link
./Results/IRESLM_experimentally_validated_circRNA_IRES_n21.csv: Performance comparison in predicting 21 experimentally validated circRNA IRESs. IRES-LM, which averages predictions from IRES-UTRLM and IRES-RNAFM, identified all 21 IRESs. In this file, "U" represents IRES-UTRLM, "R" represents IRES-RNAFM../IRESLM_UMAP_Representation_MFE: UMAP visualization of IRES-LM representations colored by minimum free energy (MFE) values.
Training was conducted on NVIDIA Tesla GPUs with 32GB memory per GPU, with IRES-UTRLM requiring approximately 355 seconds per epoch across 4 GPUs, while IRES-RNAFM required approximately 828 seconds per epoch.
IRES-EA combines IRES-LM with an evolutionary algorithm to enhance IRES activity through guided mutations.
--prefix: Sets the output file prefix (default: 'VCIP_gene_editing_AtoG')--wt_seq: Specifies the wild-type sequence to be mutated (default: VCIP sequence)--nb_folds: Number of cross-validation models to use (default: 10)--model_filename: Path to the model file (default: 'IRES_RNAFM_CLS2_best_model')
--start_nt_position: Starting position for mutations (0-based indexing)--end_nt_position: Ending position for mutations (-1 means the last position)--n_mut: Maximum number of mutations allowed for each sequence (default: 3)--mutation_type: Mutation strategy to use:mut_random: Random mutationsmut_attn: Attention-guided mutationsmut_gene_editing: Base-specific editing (e.g., A→G)
--nt_replacements: For gene editing mode, defines nucleotide substitutions (format: "A=G,C=T")--mutate2stronger: When enabled, directs mutations toward stronger IRES activity
--mut_by_prob: When enabled, uses probability-based mutation instead of transformed probability based ontransform_type--transform_type: Probability transformation method:logit: Logistic transformationsigmoid: Sigmoid functionpower_law: Power law transformationtanh: Hyperbolic tangent
--mlm_tok_num: Number of masked tokens per sequence in each iteration (default: 1)--n_designs_ep: Number of candidate mutations generated per iteration (default: 10)--n_sampling_designs_ep: Number of mutations sampled from candidates (default: 5)--n_mlm_recovery_sampling: Number of sampling attempts for masked language model recovery (default: 1)
CUDA_VISIBLE_DEVICES=0 python3 IRES_UTRLM_EA.py --prefix VCIP --wt_seq CAGCCTCGGCCAGGAGGCGACCCGGGCGCCTGGGTGTGTGGCTGCTGTTGCGGGACGTCTTCGCGGGGCGGGAGGCTCGCGCCGCAGCCAGCGCC --n_mut 10 --mutate2stronger --mutation_type mut_randomCUDA_VISIBLE_DEVICES=0 python3 IRES_RNAFM_EA.py --prefix VCIP --wt_seq CAGCCTCGGCCAGGAGGCGACCCGGGCGCCTGGGTGTGTGGCTGCTGTTGCGGGACGTCTTCGCGGGGCGGGAGGCTCGCGCCGCAGCCAGCGCC --n_mut 10 --mutate2stronger --mutation_type mut_randomcd ./Script
CUDA_VISIBLE_DEVICES=0 python3 IRESLM_Predict_IRESEA_mutations.py --data_file ../IRES_EA/vJul3_UTRLM_Design_3619_random_Mut10Sites_StrongerTrue_MutByProbFalse.csv --column_name MT --prefix vJul12_vJul3_UTRLM_Design_3619_random_Mut10Sites_StrongerTrue_MutByProbFalseIRES-DM employs a conditional diffusion model for de novo generation of functional IRES sequences.
cd ./Script
CUDA_VISIBLE_DEVICES=0 python3 IRES_DM.py --only_ires --prefix IRESDM_RewardGuided --epochs 150 --cls_model utrlmNamespace(device='0', prefix='IRESDM_RewardGuided', epochs=150, image_size=200, only_ires=True, batch_size=64, timesteps=50, beta_scheduler='linear', learning_rate=0.0001, global_seed=42, ema_beta=0.995, n_steps=10, n_samples=1000, save_and_sample_every=1, epochs_loss_show=5, time_warping=True, truncate_remain_right=True, utrlm_modelfile='../models/IRES_UTRLM_best_model', rnafm_modelfile='../models/IRES_RNAFM_best_model', cls_model='utrlm')
./results/IRESDM_RewardGuided_CLSutrlm_Gene_Class1_correctNum6178_ep149.csv
./models/IRESDM_RewardGuided_CLSutrlm_best_model.ptplease find in link
cd ./Script
CUDA_VISIBLE_DEVICES=0 python3 IRES_DM.py --only_ires --prefix IRESDM_DirectedTraining --epochs 150 --cls_model ''Namespace(device='0', prefix='IRESDM_DirectedTraining', epochs=150, image_size=200, only_ires=True, batch_size=64, timesteps=50, beta_scheduler='linear', learning_rate=0.0001, global_seed=42, ema_beta=0.995, n_steps=10, n_samples=1000, save_and_sample_every=1, epochs_loss_show=5, time_warping=True, truncate_remain_right=True, utrlm_modelfile='../models/IRES_UTRLM_best_model', rnafm_modelfile='../models/IRES_RNAFM_best_model', cls_model='')
./results/IRESDM_DirectedTraining_Gene_Class1_correctNum6076_ep149.csv
./models/IRESDM_DirectedTraining_best_model.ptplease find in link
Run the following notebooks to filter generated sequences:
./Script/IRESDM_Directed_Training_Filtering_Generated_Sequences.ipynb./Script/IRESDM_Reward_Guided_Filtering_Generated_Sequences.ipynb
Filtering criteria:
- G/C content between 40% and 60%
- Absence of consecutive repeat nucleotides
- Lengths ranging from 160 nt to 200 nt
- IRES predicted probability exceeding 90%
./results/FilterV1_IRESDM_RewardGuided_CLSutrlm_Gene_Class1_correctNum6178_ep149.csv./results/FilterV1_IRESDM_DirectedTraining_Gene_Class1_correctNum6076_ep149.csv
Run the following notebook to select 2,000 sequences with minimal structural similarity to natural IRESs:
./Script/Combined_Filtered_Generated_Sequences_From_Two_Strategies.ipynb
Combined sequence file:
./results/IRESDM_TwoStrategyCombined_Class1_GeneNum2000.csv
Run the following notebook to find the most similar IRES-DM generated sequence with Experimentally Validated IRES from IRESsite database:
./Script/FindSimilarSS_AmongIRESFrom_IRESsite_IRESDM.ipynb
Results file:
IRESsite_shorter300bp_similarGenSeq.csv
The index in columns:
RewardGuided_NearestSS_IRESsite_IRESRewardGuided_NearestSeq_IRESsite_IRESRewardGuided_MinSS_IRESsite_IRESRewardGuided_MinSeq_IRESsite_IRESDirectedTraining_NearestSS_IRESsite_IRESDirectedTraining_NearestSeq_IRESsite_IRESDirectedTraining_MinSS_IRESsite_IRESDirectedTraining_MinSeq_IRESsite_IRES
can be linked to the first columns of:
IRESDM_DirectedTraining_Gene_Class1_correctNum6076_Length25+_ep149_SSMFE.csvIRESDM_RewardGuided_CLSutrlm_Gene_Class1_correctNum6178_Length25+_ep149_SSMFE.csv
to find detailed information including sequences.
The Secondary Structure are stored in *.zip from link
For implementation, we trained both strategies for 150 epochs on a single NVIDIA Tesla GPU with 32GB memory. The reward-guided approach, which leveraged the IRES-UTRLM classifier for reward calculation, required approximately 352 seconds per epoch, while the direct training approach needed only 225 seconds per epoch. Both strategies initially generated 9,172 sequences to match the scale of our training dataset.