Skip to content

SViswanathanLab/TrPLet

Repository files navigation

TrPLet: Cancer Dependency Prediction from RNA-Seq

Transcriptional Prediction of Lethality (TrPLet)

If used, please cite the most recent version of the paper that is available on bioRxiv:

"A framework for target discovery in rare cancers" (B. Li* & Sadagopan* et al.), bioRxiv 2024

Summary

This repo provides the scripts and workflow to predict cancer dependency scores from tumor or cell-line RNA-seq data for a subset of highly predictable genes (N=6283). Although you can predict dependency scores for all genes, the accuracy will be substantially lower since most genetic dependencies are not predictable from RNA-seq data alone. The most general workflow involves:

  1. Generate/download isoform-level* RNA count data (e.g. RNA fastq -> bam -> counts using STAR/RSEM)
  2. Merge your data with a large RNA-seq dataset (cell lines: DepMap/CCLE, tumors: TCGA)
  3. Batch correct your data, read batch correction section if you are considering this
  4. Normalize RNA-seq counts
  • Calculate transcripts per kilobase million (TPM) for each isoform
  • Calculate gene TPM by summing isoform TPM per gene
  • Convert TPM to log2(TPM+1) to generate log-normal distributions
  • Z-score each feature (i.e. expression of each gene)
  1. Reduce dimensionality (subset the train+test data to the top M features with the highest |Pearson correlation coefficient| to the dependency being predicted in the train data; by default M=5000)
  2. Predict dependencies on your sample's normalized RNA-seq data

*gene level count data can also be used

The model is trained on the entirety of DepMap and tested on your dataset. When assessing model performance, we used 5-fold cross-validation across DepMap.

Workflow Considerations

The workflow differs slightly depending on input data. There are three major considerations:

Cell Line or Tumor RNA-Seq Data

  • If predicting on TCGA tumor RNA-seq, you can use gene log2(TPM+1), which can be calculated from count data here: https://osf.io/gqrz9/files/osfstorage. Z-score the expression of each gene and continue at step 5.
  • If predicting on Non-TCGA tumor RNA-seq, you should merge your data with TCGA (batch correction is likely necessary); if batch correcting, you will merge the external dataset with TCGA isoform-level count data (available here: https://osf.io/gqrz9/files/osfstorage) and continue at step 2 of the workflow.
  • If predicting on Cell Line RNA-seq, you can use gene log2(TPM+1) available from here: https://depmap.org/portal/data_page/?tab=currentRelease). Z-score the expression of each gene using the mean/standard deviation from DepMap, and continue at step 5. For this use case, batch correction usually isn't necessary. If it is required, merge with CCLE isoform-level counts (available here: https://osf.io/gqrz9/files/osfstorage) and continue at step 2 of the workflow.

Batch Correction

In general, we noticed that cell line RNA-seq often does not require batch correction with CCLE, while tumor RNA-seq with TCGA almost always does (based on tSNE analysis). We recommend batch correcting using ComBat-seq using lineage as a covariate. Choose a lineage of your sample that most closely matches with TCGA or CCLE lineages. TCGA lineages available here: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations; CCLE lineages are indicated by the string following the underscore "_" in the cell line name as indicated in the sample_info.csv file (https://depmap.org/portal/data_page/?tab=allData).

Computing Isoform-Level Counts

We recommend using STAR/RSEM, we have not tested other methods of quantification, though they may also work.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors