Overview of Data Retrieval Workflow

Get Study Metadata

Now that we are successfully connected, we may want to view all studies available for our chosen database to find the correct study_id corresponding to the data we want to pull. All studies have a unique identifier in the database. You can view all studies available in your database with the following:

all_studies <- available_studies()
all_studies
#> # A tibble: 348 × 13
#>    studyId         name  description publicStudy groups status importDate allSampleCount readPermission cancerTypeId
#>    <chr>           <chr> <chr>       <lgl>       <chr>   <int> <chr>               <int> <lgl>          <chr>       
#>  1 acc_tcga        Adre… "TCGA Adre… TRUE        "PUBL…      0 2022-03-0…             92 TRUE           acc         
#>  2 blca_plasmacyt… Blad… "Whole exo… TRUE        ""          0 2022-03-0…             34 TRUE           blca        
#>  3 bcc_unige_2016  Basa… "Whole-exo… TRUE        "PUBL…      0 2022-03-0…            293 TRUE           bcc         
#>  4 all_stjude_2015 Acut… "Comprehen… TRUE        "PUBL…      0 2022-03-0…             93 TRUE           bll         
#>  5 ampca_bcm_2016  Ampu… "Exome seq… TRUE        "PUBL…      0 2022-03-0…            160 TRUE           ampca       
#>  6 blca_dfarber_m… Blad… "Whole exo… TRUE        "PUBL…      0 2022-03-0…             50 TRUE           blca        
#>  7 blca_mskcc_sol… Blad… "Comprehen… TRUE        "PUBL…      0 2022-03-0…             97 TRUE           blca        
#>  8 blca_bgi        Blad… "Whole-exo… TRUE        "PUBL…      0 2022-03-0…             99 TRUE           blca        
#>  9 blca_mskcc_sol… Blad… "Genomic P… TRUE        "PUBL…      0 2022-03-0…            109 TRUE           blca        
#> 10 all_stjude_2013 Hypo… "Whole gen… TRUE        ""          0 2022-03-0…             44 TRUE           myeloid     
#> # … with 338 more rows, and 3 more variables: referenceGenome <chr>, pmid <chr>, citation <chr>

By inspecting this data frame, we see the unique study_id for the NMIBC data set is "blca_nmibc_2017" and the unique study_id for the prostate cancer data set is "prad_msk_2019". To get more information on our studies we can do the following:

Note: the transpose function t() is just used here to better view results

all_studies %>%
  filter(studyId %in% c("blca_nmibc_2017", "prad_msk_2019"))
#> # A tibble: 2 × 13
#>   studyId         name   description publicStudy groups status importDate allSampleCount readPermission cancerTypeId
#>   <chr>           <chr>  <chr>       <lgl>       <chr>   <int> <chr>               <int> <lgl>          <chr>       
#> 1 blca_nmibc_2017 Nonmu… IMPACT seq… TRUE        PUBLIC      0 2022-03-0…            105 TRUE           blca        
#> 2 prad_msk_2019   Prost… MSK-IMPACT… TRUE        PUBLIC      0 2022-03-0…             18 TRUE           prostate    
#> # … with 3 more variables: referenceGenome <chr>, pmid <chr>, citation <chr>

More in-depth information about the study can be found with get_study_info()

get_study_info("blca_nmibc_2017") %>%
  t()
#>                             [,1]                                                                           
#> name                        "Nonmuscle Invasive Bladder Cancer (MSK Eur Urol 2017)"                        
#> description                 "IMPACT sequencing of 105 High Risk Nonmuscle Invasive Bladder Cancer samples."
#> publicStudy                 "TRUE"                                                                         
#> pmid                        "28583311"                                                                     
#> citation                    "Pietzak et al. Eur Urol 2017"                                                 
#> groups                      "PUBLIC"                                                                       
#> status                      "0"                                                                            
#> importDate                  "2022-03-04 17:49:56"                                                          
#> allSampleCount              "105"                                                                          
#> sequencedSampleCount        "105"                                                                          
#> cnaSampleCount              "105"                                                                          
#> mrnaRnaSeqSampleCount       "0"                                                                            
#> mrnaRnaSeqV2SampleCount     "0"                                                                            
#> mrnaMicroarraySampleCount   "0"                                                                            
#> miRnaSampleCount            "0"                                                                            
#> methylationHm27SampleCount  "0"                                                                            
#> rppaSampleCount             "0"                                                                            
#> massSpectrometrySampleCount "0"                                                                            
#> completeSampleCount         "0"                                                                            
#> readPermission              "TRUE"                                                                         
#> studyId                     "blca_nmibc_2017"                                                              
#> cancerTypeId                "blca"                                                                         
#> cancerType.name             "Bladder Urothelial Carcinoma"                                                 
#> cancerType.dedicatedColor   "Yellow"                                                                       
#> cancerType.shortName        "BLCA"                                                                         
#> cancerType.parent           "bladder"                                                                      
#> cancerType.cancerTypeId     "blca"                                                                         
#> referenceGenome             "hg19"

get_study_info("prad_msk_2019") %>%
  t()
#>                             [,1]                                                             
#> name                        "Prostate Cancer (MSK, Cell Metab 2020)"                         
#> description                 "MSK-IMPACT Sequencing of 18 prostate cancer tumor/normal pairs."
#> publicStudy                 "TRUE"                                                           
#> pmid                        "31564440"                                                       
#> citation                    "Granlund et al. Cell Metab 2020"                                
#> groups                      "PUBLIC"                                                         
#> status                      "0"                                                              
#> importDate                  "2022-03-08 18:48:08"                                            
#> allSampleCount              "18"                                                             
#> sequencedSampleCount        "18"                                                             
#> cnaSampleCount              "18"                                                             
#> mrnaRnaSeqSampleCount       "0"                                                              
#> mrnaRnaSeqV2SampleCount     "0"                                                              
#> mrnaMicroarraySampleCount   "0"                                                              
#> miRnaSampleCount            "0"                                                              
#> methylationHm27SampleCount  "0"                                                              
#> rppaSampleCount             "0"                                                              
#> massSpectrometrySampleCount "0"                                                              
#> completeSampleCount         "0"                                                              
#> readPermission              "TRUE"                                                           
#> studyId                     "prad_msk_2019"                                                  
#> cancerTypeId                "prostate"                                                       
#> cancerType.name             "Prostate"                                                       
#> cancerType.dedicatedColor   "Cyan"                                                           
#> cancerType.shortName        "PROSTATE"                                                       
#> cancerType.parent           "tissue"                                                         
#> cancerType.cancerTypeId     "prostate"                                                       
#> referenceGenome             "hg19"

Lastly, it is important to know what genomic data is available for our studies. Not all studies in your database will have data available on all types of genomic information. For example, it is common for studies not to provide data on fusions.

We can check available genomic data with available_profiles().

available_profiles(study_id = "blca_nmibc_2017")
#> # A tibble: 3 × 8
#>   molecularAlterationType datatype name           description showProfileInAn… patientLevel molecularProfil… studyId
#>   <chr>                   <chr>    <chr>          <chr>       <lgl>            <lgl>        <chr>            <chr>  
#> 1 COPY_NUMBER_ALTERATION  DISCRETE Putative copy… Copy Numbe… TRUE             FALSE        blca_nmibc_2017… blca_n…
#> 2 MUTATION_EXTENDED       MAF      Mutations      Mutation d… TRUE             FALSE        blca_nmibc_2017… blca_n…
#> 3 STRUCTURAL_VARIANT      FUSION   Fusions        Fusions.    TRUE             FALSE        blca_nmibc_2017… blca_n…

available_profiles(study_id = "prad_msk_2019")
#> # A tibble: 3 × 8
#>   molecularAlterationType datatype name           description showProfileInAn… patientLevel molecularProfil… studyId
#>   <chr>                   <chr>    <chr>          <chr>       <lgl>            <lgl>        <chr>            <chr>  
#> 1 COPY_NUMBER_ALTERATION  DISCRETE Putative copy… Putative c… TRUE             FALSE        prad_msk_2019_c… prad_m…
#> 2 MUTATION_EXTENDED       MAF      Mutations      IMPACT468 … TRUE             FALSE        prad_msk_2019_m… prad_m…
#> 3 STRUCTURAL_VARIANT      FUSION   Fusions        Fusion dat… TRUE             FALSE        prad_msk_2019_f… prad_m…

Luckily, in this example our studies have mutation, copy number alteration and fusion data available. Each of these data types has a unique molecular profile ID. The molecular profile ID usually takes the form of <study_id>_mutations, <study_id>_fusion, <study_id>_cna.

available_profiles(study_id = "blca_nmibc_2017") %>%
  pull(molecularProfileId)
#> [1] "blca_nmibc_2017_cna"       "blca_nmibc_2017_mutations" "blca_nmibc_2017_fusion"

Pulling Genomic Data

Now that we have inspected our studies and confirmed the genomic data that is available, we will pull the data into our R environment. We will show two ways to do this:

Using study IDs (get_genetics_by_study())
Using sample ID-study ID pairs (get_genetics_by_sample())

Pulling by study will give us genomic data for all genes/panels included in the study. These functions can only pull data one study ID at a time and will return all genomic data available for that study. Pulling by study ID can be efficient, and a good way to ensure you have all genomic information available in cBioPortal for a particular study.

If you are working across multiple studies, or only need a subset of samples from one or multiple studies, you may chose to pull by sample IDs instead of study ID. When you pull by sample IDs you can pull specific samples across multiple studies, but must also specify the studies they belong to. You may also pass a specific list of genes for which to return information. If you don’t specify a list of genes the function will default to returning all available gene data for each sample.

By Study IDs

To pull by study ID, we can pull each data type individually.


mut_blca <- get_mutations_by_study(study_id = "blca_nmibc_2017")
#> ℹ Returning all data for the "blca_nmibc_2017_mutations" molecular profile in the "blca_nmibc_2017" study
cna_blca<- get_cna_by_study(study_id = "blca_nmibc_2017")
#> ℹ Returning all data for the "blca_nmibc_2017_cna" molecular profile in the "blca_nmibc_2017" study
fus_blca <- get_fusions_by_study(study_id = "blca_nmibc_2017")
#> ℹ Returning all data for the "blca_nmibc_2017_fusion" molecular profile in the "blca_nmibc_2017" study


mut_prad <- get_mutations_by_study(study_id = "prad_msk_2019")
#> ℹ Returning all data for the "prad_msk_2019_mutations" molecular profile in the "prad_msk_2019" study
cna_prad <- get_cna_by_study(study_id = "prad_msk_2019")
#> ℹ Returning all data for the "prad_msk_2019_cna" molecular profile in the "prad_msk_2019" study
fus_prad <- get_fusions_by_study(study_id = "prad_msk_2019")
#> ℹ Returning all data for the "prad_msk_2019_fusion" molecular profile in the "prad_msk_2019" study

Or we can pull all genomic data at the same time with get_genetics_by_study()

all_genomic_blca <- get_genetics_by_study("blca_nmibc_2017")
#> ℹ Returning all data for the "blca_nmibc_2017_mutations" molecular profile in the "blca_nmibc_2017" study
#> ℹ Returning all data for the "blca_nmibc_2017_cna" molecular profile in the "blca_nmibc_2017" study
#> ℹ Returning all data for the "blca_nmibc_2017_fusion" molecular profile in the "blca_nmibc_2017" study
all_genomic_prad<- get_genetics_by_study("prad_msk_2019")
#> ℹ Returning all data for the "prad_msk_2019_mutations" molecular profile in the "prad_msk_2019" study
#> ℹ Returning all data for the "prad_msk_2019_cna" molecular profile in the "prad_msk_2019" study
#> ℹ Returning all data for the "prad_msk_2019_fusion" molecular profile in the "prad_msk_2019" study

all_equal(mut_blca, all_genomic_blca$mutation)
#> [1] TRUE
all_equal(cna_blca, all_genomic_blca$cna)
#> [1] TRUE
all_equal(fus_blca, all_genomic_blca$fusion)
#> [1] TRUE

Finally, we can join the two studies together

mut_study <- bind_rows(mut_blca, mut_prad)
cna_study <- bind_rows(cna_blca, cna_prad)
fus_study <- bind_rows(fus_blca, fus_prad)

By Sample IDs

When we pull by sample IDs, we can pull specific samples across multiple studies. In the above example, we can pull from both studies at the same time for a select set of samples using the sample_study_pairs argument in get_genetics_by_sample().

Let’s pull data for the first 10 samples in each study. We first need to construct our dataframe to pass to the function:

Note: you can also run available_patients() to only pull patient IDs

s1 <- available_samples("blca_nmibc_2017") %>%
  select(sampleId, patientId, studyId) %>%
  head(10)

s2 <- available_samples("prad_msk_2019") %>%
  select(sampleId,  patientId, studyId) %>%
  head(10)

df_pairs <- bind_rows(s1, s2) %>%
  select(-patientId)

We need to rename the columns as per the functions documentation.

df_pairs <- df_pairs %>%
  rename("sample_id" = sampleId,
         "study_id" = studyId)

Now we pass this to get_genetics_by_sample()

all_genomic <- get_genetics_by_sample(sample_study_pairs = df_pairs)
#> Joining, by = "study_id"
#> The following parameters were used in query:
#> Study ID: "blca_nmibc_2017" and "prad_msk_2019"
#> Molecular Profile ID: blca_nmibc_2017_mutations and prad_msk_2019_mutations
#> Genes: "All available genes"
#> Joining, by = "study_id"
#> The following parameters were used in query:
#> Study ID: "blca_nmibc_2017" and "prad_msk_2019"
#> Molecular Profile ID: blca_nmibc_2017_cna and prad_msk_2019_cna
#> Genes: "All available genes"
#> Joining, by = "study_id"
#> The following parameters were used in query:
#> Study ID: "blca_nmibc_2017" and "prad_msk_2019"
#> Molecular Profile ID: blca_nmibc_2017_fusion and prad_msk_2019_fusion
#> Genes: "All available genes"

mut_sample <- all_genomic$mutation

Like with querying by study ID, you can also pull data individually by genomic data type:

mut_only <- get_mutations_by_sample(sample_study_pairs = df_pairs)
#> Joining, by = "study_id"
#> The following parameters were used in query:
#> Study ID: "blca_nmibc_2017" and "prad_msk_2019"
#> Molecular Profile ID: blca_nmibc_2017_mutations and prad_msk_2019_mutations
#> Genes: "All available genes"

identical(mut_only, mut_sample)
#> [1] TRUE

Let’s compare these results with the ones we got from pulling by study:


# filter to our subset used in sample query
mut_study_subset <- mut_study %>%
  filter(sampleId %in%  df_pairs$sample_id)

# arrange to compare
mut_study_subset <- mut_study_subset %>%
  arrange(desc(sampleId))%>%
  arrange(desc(entrezGeneId))

mut_sample <- mut_sample %>%
  arrange(desc(sampleId)) %>%
  arrange(desc(entrezGeneId)) %>%

  # reorder so columns in same order
  select(names(mut_study_subset))

all.equal(mut_study_subset, mut_sample)
#> [1] TRUE

Both results are equal.

Limit Results to Specified Genes or Panels

We can also limit our results to a specific set of genes by passing a vector of Entrez Gene IDs or Hugo Symbols to the gene argument, or a specified panel by passing a panel ID to the panel argument (see available_gene_panels() for supported panels). This can be useful if, for example, we want to pull all IMPACT gene results for two studies but one of the two uses a much larger panel. In that case, we can limit our query to just the genes for which we want results:

by_hugo <- get_mutations_by_sample(sample_study_pairs = df_pairs, genes = "TP53")
#> Joining, by = "study_id"
#> The following parameters were used in query:
#> Study ID: "blca_nmibc_2017" and "prad_msk_2019"
#> Molecular Profile ID: blca_nmibc_2017_mutations and prad_msk_2019_mutations
#> Genes: "TP53"
by_gene_id <- get_mutations_by_sample(sample_study_pairs = df_pairs, genes = 7157)
#> Joining, by = "study_id"
#> The following parameters were used in query:
#> Study ID: "blca_nmibc_2017" and "prad_msk_2019"
#> Molecular Profile ID: blca_nmibc_2017_mutations and prad_msk_2019_mutations
#> Genes: 7157

identical(by_hugo, by_gene_id)
#> [1] TRUE

get_mutations_by_sample(
  sample_study_pairs = df_pairs,
  panel = "IMPACT468") %>%
  head()
#> Joining, by = "study_id"
#> The following parameters were used in query:
#> Study ID: "blca_nmibc_2017" and "prad_msk_2019"
#> Molecular Profile ID: blca_nmibc_2017_mutations and prad_msk_2019_mutations
#> Genes: "IMPACT468"
#> # A tibble: 6 × 33
#>   hugoGeneSymbol entrezGeneId uniqueSampleKey    uniquePatientKey molecularProfil… sampleId patientId studyId center
#>   <chr>                 <int> <chr>              <chr>            <chr>            <chr>    <chr>     <chr>   <chr> 
#> 1 TERT                   7015 UC0wMDAxNDUzLVQwM… UC0wMDAxNDUzOmJ… blca_nmibc_2017… P-00014… P-0001453 blca_n… MSKCC 
#> 2 SMAD4                  4089 UC0wMDAxNDUzLVQwM… UC0wMDAxNDUzOmJ… blca_nmibc_2017… P-00014… P-0001453 blca_n… MSKCC 
#> 3 ERBB4                  2066 UC0wMDAxNDUzLVQwM… UC0wMDAxNDUzOmJ… blca_nmibc_2017… P-00014… P-0001453 blca_n… MSKCC 
#> 4 CUL3                   8452 UC0wMDAxNDUzLVQwM… UC0wMDAxNDUzOmJ… blca_nmibc_2017… P-00014… P-0001453 blca_n… MSKCC 
#> 5 PBRM1                 55193 UC0wMDAxNDUzLVQwM… UC0wMDAxNDUzOmJ… blca_nmibc_2017… P-00014… P-0001453 blca_n… MSKCC 
#> 6 APC                     324 UC0wMDAxNDUzLVQwM… UC0wMDAxNDUzOmJ… blca_nmibc_2017… P-00014… P-0001453 blca_n… MSKCC 
#> # … with 24 more variables: mutationStatus <chr>, validationStatus <chr>, tumorAltCount <int>, tumorRefCount <int>,
#> #   normalAltCount <int>, normalRefCount <int>, startPosition <int>, endPosition <int>, referenceAllele <chr>,
#> #   proteinChange <chr>, mutationType <chr>, functionalImpactScore <chr>, fisValue <dbl>, linkXvar <chr>,
#> #   linkPdb <chr>, linkMsa <chr>, ncbiBuild <chr>, variantType <chr>, chr <chr>, variantAllele <chr>,
#> #   refseqMrnaId <chr>, proteinPosStart <int>, proteinPosEnd <int>, keyword <chr>

Pulling Clinical Data & Sample Metadata

You can also pull clinical data by study ID, sample ID, or patient ID. Pulling by sample ID will pull all sample-level characteristics (e.g. sample site, tumor stage at sampling time and other variables collected at time of sampling that may be available). Pulling by patient ID will pull all patient-level characteristics (e.g. age, sex, etc.). Pulling by study ID will pull all sample and patient-level characteristics at once.

You can explore what clinical data is available a study using:

attr_blca <- available_clinical_attributes("blca_nmibc_2017")
attr_prad <- available_clinical_attributes("prad_msk_2019")

attr_prad
#> # A tibble: 13 × 7
#>    displayName                   description             datatype patientAttribute priority clinicalAttribu… studyId
#>    <chr>                         <chr>                   <chr>    <lgl>            <chr>    <chr>            <chr>  
#>  1 Cancer Type                   Cancer Type             STRING   FALSE            1        CANCER_TYPE      prad_m…
#>  2 Cancer Type Detailed          Cancer Type Detailed    STRING   FALSE            1        CANCER_TYPE_DET… prad_m…
#>  3 Fraction Genome Altered       Fraction Genome Altered NUMBER   FALSE            20       FRACTION_GENOME… prad_m…
#>  4 Gene Panel                    Gene Panel.             STRING   FALSE            1        GENE_PANEL       prad_m…
#>  5 Mutation Count                Mutation Count          NUMBER   FALSE            30       MUTATION_COUNT   prad_m…
#>  6 Oncotree Code                 Oncotree Code           STRING   FALSE            1        ONCOTREE_CODE    prad_m…
#>  7 Sample Class                  The sample classificat… STRING   FALSE            1        SAMPLE_CLASS     prad_m…
#>  8 Number of Samples Per Patient Number of Samples Per … STRING   TRUE             1        SAMPLE_COUNT     prad_m…
#>  9 Sample Type                   The type of sample (i.… STRING   FALSE            1        SAMPLE_TYPE      prad_m…
#> 10 Sex                           Sex                     STRING   TRUE             1        SEX              prad_m…
#> 11 Somatic Status                Somatic Status          STRING   FALSE            1        SOMATIC_STATUS   prad_m…
#> 12 Specimen Preservation Type    The method used for pr… STRING   FALSE            1        SPECIMEN_PRESER… prad_m…
#> 13 TMB (nonsynonymous)           TMB (nonsynonymous)     NUMBER   FALSE            1        TMB_NONSYNONYMO… prad_m…

There are a select set available for both studies:

in_both <- intersect(attr_blca$clinicalAttributeId, attr_prad$clinicalAttributeId)

The below pulls data at the sample level:

clinical_blca <- get_clinical_by_sample(sample_id = s1$sampleId,
                       study_id = "blca_nmibc_2017",
                       clinical_attribute = in_both)

clinical_prad <- get_clinical_by_sample(sample_id = s2$sampleId,
                       study_id = "prad_msk_2019",
                       clinical_attribute = in_both)

all_clinical <- bind_rows(clinical_blca, clinical_prad)

all_clinical %>%
  select(-contains("unique")) %>%
  head()
#> # A tibble: 6 × 5
#>   sampleId          patientId studyId         clinicalAttributeId     value                       
#>   <chr>             <chr>     <chr>           <chr>                   <chr>                       
#> 1 P-0001453-T01-IM3 P-0001453 blca_nmibc_2017 CANCER_TYPE             Bladder Cancer              
#> 2 P-0001453-T01-IM3 P-0001453 blca_nmibc_2017 CANCER_TYPE_DETAILED    Bladder Urothelial Carcinoma
#> 3 P-0001453-T01-IM3 P-0001453 blca_nmibc_2017 FRACTION_GENOME_ALTERED 0.4448                      
#> 4 P-0001453-T01-IM3 P-0001453 blca_nmibc_2017 MUTATION_COUNT          11                          
#> 5 P-0001453-T01-IM3 P-0001453 blca_nmibc_2017 ONCOTREE_CODE           BLCA                        
#> 6 P-0001453-T01-IM3 P-0001453 blca_nmibc_2017 SOMATIC_STATUS          Matched

The below pulls data at the patient level:

p1 <- available_patients("blca_nmibc_2017")

clinical_blca <- get_clinical_by_patient(patient_id = s1$patientId,
                       study_id = "blca_nmibc_2017",
                       clinical_attribute = in_both)

clinical_prad <- get_clinical_by_sample(sample_id = s2$patientId,
                       study_id = "prad_msk_2019",
                       clinical_attribute = in_both)

all_clinical <- bind_rows(clinical_blca, clinical_prad)

all_clinical %>%
  select(-contains("unique")) %>%
  head()
#> # A tibble: 6 × 4
#>   patientId studyId         clinicalAttributeId value
#>   <chr>     <chr>           <chr>               <chr>
#> 1 P-0001453 blca_nmibc_2017 SAMPLE_COUNT        1    
#> 2 P-0001453 blca_nmibc_2017 SEX                 Male 
#> 3 P-0002166 blca_nmibc_2017 SAMPLE_COUNT        1    
#> 4 P-0002166 blca_nmibc_2017 SEX                 Male 
#> 5 P-0003238 blca_nmibc_2017 SAMPLE_COUNT        1    
#> 6 P-0003238 blca_nmibc_2017 SEX                 Male

Like with the genomic data pull functions, you can also pull clinical data by study ID - sample ID, or ID study ID - patient ID pairs:

df_pairs <- bind_rows(s1, s2) %>%
  select(-sampleId)

df_pairs <- df_pairs %>%
  select(patientId, studyId)