Predicting Impact of Coding Variants

For scoring the impact of coding variants, we devised a metapredictor model that utilizes impact prediction scores from 12 different tools: SIFT, PolyPhen-2 (HumDiv scores), LRT, MutationTaster, Mutation Assessor, FATHMM, GERP++, PhyloP, CADD, VEST, SiPhy and DANN. Annotations for all these tools are available in dbNFSP via ANNOVAR. We provide with the package 2 example (shortened) ANNOVAR outputs (see next sections):

library(driveR)
path2annovar_csv <- system.file("extdata/example.hg19_multianno.csv",
                                 package = "driveR")

We can calculate impact scores for all coding variants in this ANNOVAR file via predict_coding_impact():

metaprediction_df <- predict_coding_impact(annovar_csv_path = path2annovar_csv)
head(metaprediction_df)
#>       gene_symbol metaprediction_score
#> 531         MUTYH                1.000
#> 2144        AMPD3                1.000
#> 6609         RPIA                1.000
#> 8510        GPAT3                1.000
#> 10040        GUSB                1.000
#> 854      SELENBP1                0.998

By default, predict_coding_impact() keeps only the maximal score per gene. This behavior can be altered by setting keep_highest_score = FALSE:

metaprediction_df <- predict_coding_impact(annovar_csv_path = path2annovar_csv, 
                                           keep_highest_score = FALSE)

Also by default, predict_coding_impact() keeps only the first gene symbol for genes where multiple symbols were annotated. This behavior can be altered by setting keep_single_symbol = FALSE:

metaprediction_df <- predict_coding_impact(annovar_csv_path = path2annovar_csv, 
                                           keep_single_symbol = FALSE)

The default na.string (string that was used to indicate when a score is not available during annotation with ANNOVAR) is “.”, this can be altered via:

metaprediction_df <- predict_coding_impact(annovar_csv_path = path2annovar_csv, 
                                           na.string = "NA")

Driver Gene Prioritization - Personalized Analysis

Below, a step-by-step work flow for driveR for an individual tumor sample is provided. Note that some steps require operations outside of R. The example data provided within the package are for a lung adenocarcinoma patient studied in Imielinski et al. ¹

Load the package and prepare input data

library(driveR)

As input, create_features_df(), the function used to create the features table for driver gene prioritization, requires the following:

annovar_csv_path: the /path/to/ANNOVAR/csv/output/file
scna_df: a data frame containing SCNA segments, containing the columns “chr”, “start”, “end” and “log2ratio”
phenolyzer_annotated_gene_list_path: the /path/to/phenolyzer/annotated_gene_list/output/file

ANNOVAR CSV output

We first run ANNOVAR. An example command provided below:

table_annovar.pl example.avinput ~/annovar/humandb/ -buildver hg19 -out /path/to/output -remove -protocol refGene,cytoBand,exac03,avsnp150,dbnsfp30a,cosmic92_coding,cosmic92_noncoding -operation gx,r,f,f,f,f,f -nastring . -csvout -polish

The required filters are, as listed in the above command, refGene,cytoBand,exac03,avsnp147,dbnsfp30a,cosmic91_coding,cosmic91_noncoding.

With the package, an example (modified) ANNOVAR csv output file is available:

path2annovar_csv <- system.file("extdata/example.hg19_multianno.csv",
                                 package = "driveR")

SCNA table

Next, we prepare a SCNA data frame (GRCh37 is required). Again, an example data frame is provided within the package:

head(example_scna_table)
#>   chr   start     end log2ratio
#> 1   1       1    9001 0.8875253
#> 2   1    9002   10001 0.6959938
#> 3   1   10002 3695599 0.3785116
#> 4   1 3695600 4715896 0.4646683
#> 5   1 4715897 4717788 0.4005379
#> 6   1 4717789 5798840 0.4854268

phenolyzer output

Finally, for the phenolyzer annotated_gene_list output file, we first obtain the genes to be scored using create_features_df and setting prep_phenolyzer_input=TRUE:

phenolyzer_genes <- create_features_df(annovar_csv_path = path2annovar_csv,
                                       scna_df = example_scna_table,
                                       prep_phenolyzer_input = TRUE)

Next, we save these genes to be used as input for phenolyzer:

write.table(x = data.frame(gene = phenolyzer_genes), 
            file = "input_genes.txt", 
            row.names = FALSE, col.names = FALSE, quote = FALSE)

We create another text file, named phenolyzer_disease.txt, containing the phenotype (i.e. cancer type. in this case “lung adenocarcinoma”) to be used for scoring with phenolyzer. An example command for running phenolyzer is provided below:

perl ~/phenolyzer/disease_annotation.pl -f phenolyzer_disease.txt -prediction -phenotype -logistic --gene input_genes.txt -out phenolyzer_out/example

This should produce, among other outputs, the annotated_gene_list output file. An example annotated_gene_list output file is provided within the package:

path2phenolyzer_out <- system.file("extdata/example.annotated_gene_list",
                                    package = "driveR")

Create the features data frame

After creating the necessary input data (as detailed above), we simply run create_features_df() to obtain the features data frame:

features_df <- create_features_df(annovar_csv_path = path2annovar_csv,
                                  scna_df = example_scna_table, 
                                  phenolyzer_annotated_gene_list_path = path2phenolyzer_out)

Prioritize cancer driver genes

Finally, we can prioritize cancer driver genes using prioritize_driver_genes(). For this function, cancer_type can be of the short names for the 21 different cancer types that was used to train the multi-task learning model utilized in this package:

knitr::kable(MTL_submodel_descriptions)

short_name	description
BLCA	Bladder Urothelial Cancer
BRCA	Breast Cancer
CESC	Cervical Squamous Cell Carcinoma
COAD	Colon Adenocarcinoma
GBM	Brain Glioblastoma Multiforme
HNSC	Head and Neck Squamous Cell Carcinoma
KIRC	Kidney Renal Clear Cell Carcinoma
KIRP	Kidney Renal Papillary Cell Carcinoma
LAML	Acute Myeloid Leukemia
LGG	Brain Lower Grade Glioma
LIHC	Liver Hepatocellular carcinoma
LUAD	Lung Adenocarcinoma
LUSC	Lung Squamous Cell Carcinoma
OV	Ovarian Serous Cystadenocarcinoma
PAAD	Pancreatic Cancer
PRAD	Prostate Adenocarcinoma
READ	Rectum Adenocarcinoma
SKCM	Skin Cutaneous melanoma
STAD	Gastric Adenocarcinoma
THCA	Head and Neck Thyroid Carcinoma
UCEC	Uterine Corpus Endometrial Carcinoma

Below is the example run for the lung adenocarcinoma patient:

driver_prob_df <- prioritize_driver_genes(features_df = features_df,
                                          cancer_type = "LUAD")
head(driver_prob_df, 10)
#>      gene_symbol driverness_prob prediction
#> 842         TP53       0.9614544     driver
#> 3211       CCND3       0.8604087     driver
#> 3654        EGFR       0.7485849 non-driver
#> 1966       EP300       0.5799741 non-driver
#> 679          ATM       0.5753917 non-driver
#> 510          KDR       0.5012796 non-driver
#> 386       IFNA10       0.4526894 non-driver
#> 740         IL7R       0.4271572 non-driver
#> 3572      PIK3R2       0.3780803 non-driver
#> 2300         ATR       0.3666499 non-driver

The function returns a data frame of genes with probabilities of being cancer driver genes (using the selected sub-model of a multi-task learning model trained using 21 different cancer types) and the prediction of driver genes based on a cancer type specific probability threshold. In the above example, the top 10 genes contain recognizable cancer driver genes.

Driver Gene Prioritization - Batch Analysis

Below, a step-by-step work flow for driveR for a cohort of tumor samples is provided. Note that some steps require operations outside of R. The example data provided within the package are for 10 randomly-selected samples from The Cancer Genome Atlas (TCGA) program’s LAML (Acute Myeloid Leukemia) cohort.

Load the package and prepare input data

library(driveR)

As before, as input, create_features_df(), the function used to create the features table for driver gene prioritization, requires the following:

annovar_csv_path: the /path/to/ANNOVAR/csv/output/file
scna_df: a data frame containing SCNA segments, containing the columns “chr”, “start”, “end”, “log2ratio” and “tumor_id”
phenolyzer_annotated_gene_list_path: the /path/to/phenolyzer/annotated_gene_list/output/file

ANNOVAR CSV output

The required filters are again, as listed in the example ANNOVAR command above, refGene,cytoBand,exac03,avsnp147,dbnsfp30a,cosmic91_coding,cosmic91_noncoding. Additionally, for cohort-level analysis, a column named tumor_id is required, containing tumor id of each variant.

With the package, an example cohort-level (modified) ANNOVAR csv output file is available:

path2annovar_csv <- system.file("extdata/example_cohort.hg19_multianno.csv",
                                 package = "driveR")

SCNA table

Next, we prepare a SCNA data frame (GRCh37 is required). Again, an example data frame is provided within the package:

head(example_cohort_scna_table)
#>       chr     start       end log2ratio tumor_id
#> 29597  22  25922342  42892582   -0.0085  DO21131
#> 29598  15  20581451  20586957    0.8688  DO21131
#> 29599   1  72768418  72809431    0.3540  DO21131
#> 29600  13  58814227  62806981    0.0023  DO21131
#> 29601   2 208354955 208359336   -0.8954  DO21131
#> 29602   5 117388963 120417166   -0.0114  DO21131

Again, for cohort-level analysis, it’s crucial to have a column named tumor_id is required, containing tumor id for each SCNA segment.

phenolyzer output

Finally, for the phenolyzer annotated_gene_list output file, we first obtain the genes to be scored using create_features_df and setting prep_phenolyzer_input=TRUE:

phenolyzer_genes <- create_features_df(annovar_csv_path = path2annovar_csv,
                                       scna_df = example_cohort_scna_table,
                                       prep_phenolyzer_input = TRUE,
                                       batch_analysis = TRUE)

Next, we save these genes to be used as input for phenolyzer:

write.table(x = data.frame(gene = phenolyzer_genes), 
            file = "input_genes.txt", 
            row.names = FALSE, col.names = FALSE, quote = FALSE)

We create another text file, named phenolyzer_disease.txt, containing the phenotype (i.e. cancer type. in this case “acute myeloid leukemia”) to be used for scoring with phenolyzer. An example command for running phenolyzer is provided below:

perl ~/phenolyzer/disease_annotation.pl -f phenolyzer_disease.txt -prediction -phenotype -logistic --gene input_genes.txt -out phenolyzer_out/example

This should produce, among other outputs, the annotated_gene_list output file. An example annotated_gene_list output file for the cohort-level data is provided within the package:

path2phenolyzer_out <- system.file("extdata/example_cohort.annotated_gene_list",
                                    package = "driveR")

Create the features data frame

After creating the necessary input data (as detailed above), we simply run create_features_df() to obtain the features data frame. Notice that we set batch_analysis = TRUE. When batch_analysis = TRUE, both the ANNOVAR output and the SCNA table should have a column named ‘tumor_id’.

features_df <- create_features_df(annovar_csv_path = path2annovar_csv,
                                  scna_df = example_cohort_scna_table, 
                                  phenolyzer_annotated_gene_list_path = path2phenolyzer_out,
                                  batch_analysis = TRUE)

Prioritize cancer driver genes

Finally, we can prioritize cancer driver genes using prioritize_driver_genes().

Below is the example run for the acute myeloid leukemia cohort:

driver_prob_df <- prioritize_driver_genes(features_df = features_df,
                                          cancer_type = "LAML")
head(driver_prob_df, 10)
#>     gene_symbol driverness_prob prediction
#> 49         FLT3       0.9922470     driver
#> 5          TP53       0.9569072     driver
#> 291       EP300       0.9451127     driver
#> 338      CDKN2A       0.5800416 non-driver
#> 11        SMC1A       0.3474567 non-driver
#> 258      ANAPC4       0.2596510 non-driver
#> 311      IFNGR1       0.2538748 non-driver
#> 67         MDM4       0.2396303 non-driver
#> 324       TRAF3       0.2383674 non-driver
#> 359      PIEZO1       0.2371796 non-driver

Once again, in the above example, the top 10 genes contain recognizable cancer driver genes.

Imielinski M, Greulich H, Kaplan B, et al. Oncogenic and sorafenib-sensitive ARAF mutations in lung adenocarcinoma. J Clin Invest. 2014;124(4):1582-6.↩︎

How to use driveR

Overview

Predicting Impact of Coding Variants

Driver Gene Prioritization - Personalized Analysis

Load the package and prepare input data

ANNOVAR CSV output

SCNA table

phenolyzer output

Create the features data frame

Prioritize cancer driver genes

Driver Gene Prioritization - Batch Analysis

Load the package and prepare input data

ANNOVAR CSV output

SCNA table

phenolyzer output

Create the features data frame

Prioritize cancer driver genes