How to QA Your IMPACT Data

Introduction

The purpose of this vignette is to outline best practices for downloading, QA-ing and analyzing data generated from MSK IMPACT, a targeted tumor-sequencing test that can detect more than 468 gene mutations and other critical genetic changes in common and rare cancers. Using a hepatocellular cancer case study, we demonstrate a data analysis pipeline using {cbioportalR} functions that can help users generate reproducible analyses using this data.

Setting up

For this vignette, we will be using {cbioportalR}, a package to download data from the cBioPortal website. We will also be using {dplyr} and {tidyr} to clean and manipulate the data:

library(gnomeR)
library(cbioportalR)
library(dplyr)
library(tidyr)

To access cBioPortal data using the {cbioportalR} package, you must set the cBioPortal database using the set_cbioportal_db() function. To access public data, set this to db = public. If you are using a private version of cBioPortal, you would set the db argument to your institution’s cBioPortal URL.


set_cbioportal_db(db = "public")
#> ✔ You are successfully connected!
#> ✔ base_url for this R session is now set to "www.cbioportal.org/api"

Case Study

Scenario: You are a data analyst whose collaborator has sent you a clinical file of a cohort of patients with hepatocellular cancer that she is interested in for retrospective data analysis. In particular, she wants to look at IMPACT sequencing data for the cohort and investigate associations between genomic alterations and pathological and clinical characteristics. She asks if you can get the IMPACT data and do the analysis.

She gives you a clinical file with 80 sample IDs: clin_collab_df.

head(clin_collab_df)
#> # A tibble: 6 × 3
#>   cbioportal_id     ctype          primary_mets
#>   <chr>             <chr>          <chr>       
#> 1 P-0066922-T02-IM7 Hepatocellular Primary     
#> 2 P-0009540-T01-IM5 Hepatocellular Primary     
#> 3 P-0000182-T01-IM3 Hepatocellular Metastasis  
#> 4 P-0000037-T02-IM3 Hepatocellular Primary     
#> 5 P-0005357-T01-IM5 Hepatocellular Primary     
#> 6 P-0007773-T01-IM5 Hepatocellular Metastasis

The sample IDs are in the cbioportal_id column.

Before using cBioPortal to access the genomic data, you first want to do some QA on the clinical data and make sure it matches up with the clinical data in cBioPortal.

Check For Multiple Samples Per Patient

One of the first things to check in your data is whether you have multiple sample IDs from the same patient. Sometimes a clinical file will have a patient_ID column as well; this one doesn’t, so you can make your own. The patient ID is just the first 9 digits of the cbioportal_id:


clin_collab_df <- clin_collab_df %>%
  mutate(
    patient_id = substr(cbioportal_id, 1, 9)
  )

If there is only one sample per patient, there should be the same number of samples as patients.

clin_collab_df %>%
    summarize( patients = length(unique(patient_id)),
            samples= length(unique(cbioportal_id))) 
#> # A tibble: 1 × 2
#>   patients samples
#>      <int>   <int>
#> 1       78      80

So it’s clear that we have multiple samples per patient. To find out which patient/s, you can count the patient_id values and filter for >1.

multiple_samps <- clin_collab_df %>%
    count(patient_id) %>%
  filter(n > 1)
multiple_samps
#> # A tibble: 2 × 2
#>   patient_id     n
#>   <chr>      <int>
#> 1 P-0004876      2
#> 2 P-0012198      2

There are 2 patients who each have 2 samples in the collaborator’s dataset. Filter the dataset to see the cbioportal_id’s in question:


clin_collab_df %>% 
  filter(
  patient_id %in% 
    (multiple_samps$patient_id))
#> # A tibble: 4 × 4
#>   cbioportal_id     ctype          primary_mets patient_id
#>   <chr>             <chr>          <chr>        <chr>     
#> 1 P-0004876-T01-IM5 Hepatocellular Primary      P-0004876 
#> 2 P-0012198-T02-IM5 Hepatocellular Primary      P-0012198 
#> 3 P-0004876-T02-IM5 Hepatocellular Primary      P-0004876 
#> 4 P-0012198-T01-IM5 Hepatocellular Primary      P-0012198

These are patients and samples to ask your collaborator about: Does using both samples make sense? Often times the answer is no. And if not, which sample is the most appropriate one to include? (To get more info for yourself, you can enter the patient ids into the cBioPortal website.)

Check That All `cbioportal_ids` Are In cBioPortal Database

To do this, you need to retrieve the clinical data from cBioPortal using {cbioportalR}. You can use the get_clinical_by_sample() function from {cbioportalR} to do this. Set the sample_id parameter to the cbioportal_ids from the clinical collaborator’s file.
Store the sample data in a file called clin_cbio.


clin_cbio = get_clinical_by_sample(sample_id = clin_collab_df$cbioportal_id) 
#> ! No `clinical_attribute` passed. Defaulting to returning
#> all clinical attributes in "msk_impact_2017" study

(You can disregard the warning message for now, though you may be interested in specific clinical attributes later.)

Note: If you are using the public version of cBioPortal, this function will only query the msk_impact_2017 study.

Notice that you now have 2 clinical files: one given to you by the collaborator (clin_collab_df) and one you have retrieved yourself from cBioPortal (clin_cbio).

Here’s the header of clin_cbio:

head(clin_cbio) %>% as.data.frame()
#>                                uniqueSampleKey
#> 1 UC0wMDAwMDM3LVQwMi1JTTM6bXNrX2ltcGFjdF8yMDE3
#> 2 UC0wMDAwMDM3LVQwMi1JTTM6bXNrX2ltcGFjdF8yMDE3
#> 3 UC0wMDAwMDM3LVQwMi1JTTM6bXNrX2ltcGFjdF8yMDE3
#> 4 UC0wMDAwMDM3LVQwMi1JTTM6bXNrX2ltcGFjdF8yMDE3
#> 5 UC0wMDAwMDM3LVQwMi1JTTM6bXNrX2ltcGFjdF8yMDE3
#> 6 UC0wMDAwMDM3LVQwMi1JTTM6bXNrX2ltcGFjdF8yMDE3
#>                     uniquePatientKey          sampleId patientId
#> 1 UC0wMDAwMDM3Om1za19pbXBhY3RfMjAxNw P-0000037-T02-IM3 P-0000037
#> 2 UC0wMDAwMDM3Om1za19pbXBhY3RfMjAxNw P-0000037-T02-IM3 P-0000037
#> 3 UC0wMDAwMDM3Om1za19pbXBhY3RfMjAxNw P-0000037-T02-IM3 P-0000037
#> 4 UC0wMDAwMDM3Om1za19pbXBhY3RfMjAxNw P-0000037-T02-IM3 P-0000037
#> 5 UC0wMDAwMDM3Om1za19pbXBhY3RfMjAxNw P-0000037-T02-IM3 P-0000037
#> 6 UC0wMDAwMDM3Om1za19pbXBhY3RfMjAxNw P-0000037-T02-IM3 P-0000037
#>           studyId     clinicalAttributeId                    value
#> 1 msk_impact_2017             CANCER_TYPE     Hepatobiliary Cancer
#> 2 msk_impact_2017    CANCER_TYPE_DETAILED Hepatocellular Carcinoma
#> 3 msk_impact_2017               DNA_INPUT                      250
#> 4 msk_impact_2017 FRACTION_GENOME_ALTERED                   0.2373
#> 5 msk_impact_2017          MATCHED_STATUS                  Matched
#> 6 msk_impact_2017          MUTATION_COUNT                        3

The sample IDs here are in the sampleId column. You may notice that this file is in “long” format and each sample has multiple rows. Later we will convert this file to “wide” format to do QA checking on attributes.

But the first thing you want to know is whether you are able to find all of the cbioportal_ids from your clin_collab_df file in the clin_cbio file.
To do this, use the setdiff() function:


setdiff(clin_collab_df$cbioportal_id, clin_cbio$sampleId)
#> [1] "P-0066922-T02-IM7" "P-0070148-T01-IM5"

So there are two sample ID’s from your clinical file (clin_collab_df) that are currently not found in cBioPortal (in your clin_cbio file). Include these in the list of cBioPortal questions to ask your collaborator.

(Again, if you want to investigate a bit further, you could enter the patient cBioPortal IDs as queries into the cBioPortal website.)

Check Clinical Data Matches cBioPortal Database

Now we need to check whether clinical information in collaborator’s file (clin_collab_df) matches clinical information in cBioPortal (in your clin_cbio file).

Look at the clin_collab_df again:

head(clin_collab_df)
#> # A tibble: 6 × 4
#>   cbioportal_id     ctype          primary_mets patient_id
#>   <chr>             <chr>          <chr>        <chr>     
#> 1 P-0066922-T02-IM7 Hepatocellular Primary      P-0066922 
#> 2 P-0009540-T01-IM5 Hepatocellular Primary      P-0009540 
#> 3 P-0000182-T01-IM3 Hepatocellular Metastasis   P-0000182 
#> 4 P-0000037-T02-IM3 Hepatocellular Primary      P-0000037 
#> 5 P-0005357-T01-IM5 Hepatocellular Primary      P-0005357 
#> 6 P-0007773-T01-IM5 Hepatocellular Metastasis   P-0007773

Aside from cbioportal_id, you have cancer type (ctype) and sample type (primary_mets) variables. Because it’s a hepatocellular cancer study, all of the ctype values will be the same. To double check that, count ctype:

clin_collab_df %>% count(ctype)
#> # A tibble: 1 × 2
#>   ctype              n
#>   <chr>          <int>
#> 1 Hepatocellular    80

So the only variable you can check in this example is the primary_mets. To see if the clin_cbio file has an analogous variable to check, first see the attributes that are available in it.

clin_cbio %>% count(clinicalAttributeId)
#> # A tibble: 18 × 2
#>    clinicalAttributeId            n
#>    <chr>                      <int>
#>  1 CANCER_TYPE                   78
#>  2 CANCER_TYPE_DETAILED          78
#>  3 DNA_INPUT                     78
#>  4 FRACTION_GENOME_ALTERED       78
#>  5 MATCHED_STATUS                78
#>  6 METASTATIC_SITE                7
#>  7 MUTATION_COUNT                78
#>  8 ONCOTREE_CODE                 78
#>  9 PRIMARY_SITE                  78
#> 10 SAMPLE_CLASS                  78
#> 11 SAMPLE_COLLECTION_SOURCE      78
#> 12 SAMPLE_COVERAGE               78
#> 13 SAMPLE_TYPE                   78
#> 14 SOMATIC_STATUS                78
#> 15 SPECIMEN_PRESERVATION_TYPE    78
#> 16 SPECIMEN_TYPE                 78
#> 17 TMB_NONSYNONYMOUS             78
#> 18 TUMOR_PURITY                  77

To quickly see values associated with a particular attribute, filter by the attribute and count the values. For example:

clin_cbio %>% filter(clinicalAttributeId=="SAMPLE_TYPE") %>% count(value)
#> # A tibble: 2 × 2
#>   value          n
#>   <chr>      <int>
#> 1 Metastasis     7
#> 2 Primary       71

The attribute SAMPLE_TYPE looks like the appropriate variable to check primary_mets against. To do this, we will convert clin_cbio to “wide” form (only for the SAMPLE_TYPE variable for now), merge it with clin_collab_df and then cross-tabulate the 2 variables.

To convert clin_cbio to “wide” form:


clin_cbio_wide = clin_cbio %>% 
  select( sampleId, clinicalAttributeId, value) %>%
  filter( clinicalAttributeId == "SAMPLE_TYPE") %>% 
  pivot_wider(names_from = clinicalAttributeId, values_from = value)

Take a look at the “wide” file:

head(clin_cbio_wide) %>% as.data.frame()
#>            sampleId SAMPLE_TYPE
#> 1 P-0000037-T02-IM3     Primary
#> 2 P-0000182-T01-IM3  Metastasis
#> 3 P-0000218-T01-IM3  Metastasis
#> 4 P-0000228-T03-IM5     Primary
#> 5 P-0000587-T01-IM3  Metastasis
#> 6 P-0000829-T01-IM3     Primary

Now to check the primary_mets variable from clin_collab_df against the SAMPLE_TYPE variable from clin_cbio_wide, merge the files and tabulate the variables.

clin_merged <- clin_cbio_wide %>% left_join(clin_collab_df, by = c("sampleId" = "cbioportal_id")) 
clin_merged %>% select(primary_mets, SAMPLE_TYPE) %>% table()
#>             SAMPLE_TYPE
#> primary_mets Metastasis Primary
#>   Metastasis          7       1
#>   Primary             0      70

There is 1 sample that has a value of “Metastasis” for the primary_mets variable but “Primary” for the SAMPLE_TYPE variable. To find the sample ID, filter:

clin_merged %>% filter(primary_mets == "Metastasis" & SAMPLE_TYPE == "Primary")
#> # A tibble: 1 × 5
#>   sampleId          SAMPLE_TYPE ctype          primary_mets patient_id
#>   <chr>             <chr>       <chr>          <chr>        <chr>     
#> 1 P-0001324-T01-IM3 Primary     Hepatocellular Metastasis   P-0001324

Include this sample in the list of questions for your collaborator. Either she will need to update her clinical file with the correct value or you/she will have to notify cBioPortal to update their database.