TrPLet: Cancer Dependency Prediction from RNA-Seq

Transcriptional Prediction of Lethality (TrPLet)

If used, please cite the most recent version of the paper that is available on bioRxiv:

"A framework for target discovery in rare cancers" (B. Li* & Sadagopan* et al.), bioRxiv 2024

Summary

This repo provides the scripts and workflow to predict cancer dependency scores from tumor or cell-line RNA-seq data for a subset of highly predictable genes (N=6283). Although you can predict dependency scores for all genes, the accuracy will be substantially lower since most genetic dependencies are not predictable from RNA-seq data alone. The most general workflow involves:

Generate/download isoform-level* RNA count data (e.g. RNA fastq -> bam -> counts using STAR/RSEM)
Merge your data with a large RNA-seq dataset (cell lines: DepMap/CCLE, tumors: TCGA)
Batch correct your data, read batch correction section if you are considering this
Normalize RNA-seq counts

Calculate transcripts per kilobase million (TPM) for each isoform
Calculate gene TPM by summing isoform TPM per gene
Convert TPM to log₂(TPM+1) to generate log-normal distributions
Z-score each feature (i.e. expression of each gene)

Reduce dimensionality (subset the train+test data to the top M features with the highest |Pearson correlation coefficient| to the dependency being predicted in the train data; by default M=5000)
Predict dependencies on your sample's normalized RNA-seq data

*gene level count data can also be used

The model is trained on the entirety of DepMap and tested on your dataset. When assessing model performance, we used 5-fold cross-validation across DepMap.

Workflow Considerations

The workflow differs slightly depending on input data. There are three major considerations:

Cell Line or Tumor RNA-Seq Data

If predicting on TCGA tumor RNA-seq, you can use gene log₂(TPM+1), which can be calculated from count data here: https://osf.io/gqrz9/files/osfstorage. Z-score the expression of each gene and continue at step 5.
If predicting on Non-TCGA tumor RNA-seq, you should merge your data with TCGA (batch correction is likely necessary); if batch correcting, you will merge the external dataset with TCGA isoform-level count data (available here: https://osf.io/gqrz9/files/osfstorage) and continue at step 2 of the workflow.
If predicting on Cell Line RNA-seq, you can use gene log₂(TPM+1) available from here: https://depmap.org/portal/data_page/?tab=currentRelease). Z-score the expression of each gene using the mean/standard deviation from DepMap, and continue at step 5. For this use case, batch correction usually isn't necessary. If it is required, merge with CCLE isoform-level counts (available here: https://osf.io/gqrz9/files/osfstorage) and continue at step 2 of the workflow.

Batch Correction

In general, we noticed that cell line RNA-seq often does not require batch correction with CCLE, while tumor RNA-seq with TCGA almost always does (based on tSNE analysis). We recommend batch correcting using ComBat-seq using lineage as a covariate. Choose a lineage of your sample that most closely matches with TCGA or CCLE lineages. TCGA lineages available here: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations; CCLE lineages are indicated by the string following the underscore "_" in the cell line name as indicated in the sample_info.csv file (https://depmap.org/portal/data_page/?tab=allData).

Computing Isoform-Level Counts

We recommend using STAR/RSEM, we have not tested other methods of quantification, though they may also work.

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
Table_S4_InteractivePlots		Table_S4_InteractivePlots
snakemake_TrPLet		snakemake_TrPLet
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
README2.md		README2.md
step1.qsub		step1.qsub
step2_merge.py		step2_merge.py
step3_batchcorrect.R		step3_batchcorrect.R
step4_normalize.py		step4_normalize.py
steps5_to_7.py		steps5_to_7.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrPLet: Cancer Dependency Prediction from RNA-Seq

Summary

Workflow Considerations

Cell Line or Tumor RNA-Seq Data

Batch Correction

Computing Isoform-Level Counts

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TrPLet: Cancer Dependency Prediction from RNA-Seq

Summary

Workflow Considerations

Cell Line or Tumor RNA-Seq Data

Batch Correction

Computing Isoform-Level Counts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages