Introduction
The purpose of this vignette is to outline best practices for downloading, QA-ing and analyzing data generated from MSK IMPACT, a targeted tumor-sequencing test that can detect more than 468 gene mutations and other critical genetic changes in common and rare cancers. Using a hepatocellular cancer case study, we demonstrate a data analysis pipeline using {cbioportalR} functions that can help users generate reproducible analyses using this data.
Setting up
For this vignette, we will be using {cbioportalR}, a package to download data from the cBioPortal website. We will also be using {dplyr} and {tidyr} to clean and manipulate the data:
To access cBioPortal data using the {cbioportalR}
package, you must set the cBioPortal database using the
set_cbioportal_db()
function. To access public data, set
this to db = public
. If you are using a private version of
cBioPortal, you would set the db
argument to your
institution’s cBioPortal URL.
set_cbioportal_db(db = "public")
#> ✔ You are successfully connected!
#> ✔ base_url for this R session is now set to "www.cbioportal.org/api"
Case Study
Scenario: You are a data analyst whose collaborator has sent you a clinical file of a cohort of patients with hepatocellular cancer that she is interested in for retrospective data analysis. In particular, she wants to look at IMPACT sequencing data for the cohort and investigate associations between genomic alterations and pathological and clinical characteristics. She asks if you can get the IMPACT data and do the analysis.
She gives you a clinical file with 80 sample IDs:
clin_collab_df
.
head(clin_collab_df)
#> # A tibble: 6 × 3
#> cbioportal_id ctype primary_mets
#> <chr> <chr> <chr>
#> 1 P-0066922-T02-IM7 Hepatocellular Primary
#> 2 P-0009540-T01-IM5 Hepatocellular Primary
#> 3 P-0000182-T01-IM3 Hepatocellular Metastasis
#> 4 P-0000037-T02-IM3 Hepatocellular Primary
#> 5 P-0005357-T01-IM5 Hepatocellular Primary
#> 6 P-0007773-T01-IM5 Hepatocellular Metastasis
The sample IDs are in the cbioportal_id
column.
Before using cBioPortal to access the genomic data, you first want to do some QA on the clinical data and make sure it matches up with the clinical data in cBioPortal.
Check For Multiple Samples Per Patient
One of the first things to check in your data is whether you have
multiple sample IDs from the same patient. Sometimes a clinical file
will have a patient_ID column as well; this one doesn’t, so you can make
your own. The patient ID is just the first 9 digits of the
cbioportal_id
:
If there is only one sample per patient, there should be the same number of samples as patients.
clin_collab_df %>%
summarize( patients = length(unique(patient_id)),
samples= length(unique(cbioportal_id)))
#> # A tibble: 1 × 2
#> patients samples
#> <int> <int>
#> 1 78 80
So it’s clear that we have multiple samples per patient. To find out
which patient/s, you can count the patient_id
values and
filter for >1.
multiple_samps <- clin_collab_df %>%
count(patient_id) %>%
filter(n > 1)
multiple_samps
#> # A tibble: 2 × 2
#> patient_id n
#> <chr> <int>
#> 1 P-0004876 2
#> 2 P-0012198 2
There are 2 patients who each have 2 samples in the collaborator’s
dataset. Filter the dataset to see the cbioportal_id
’s in
question:
clin_collab_df %>%
filter(
patient_id %in%
(multiple_samps$patient_id))
#> # A tibble: 4 × 4
#> cbioportal_id ctype primary_mets patient_id
#> <chr> <chr> <chr> <chr>
#> 1 P-0004876-T01-IM5 Hepatocellular Primary P-0004876
#> 2 P-0012198-T02-IM5 Hepatocellular Primary P-0012198
#> 3 P-0004876-T02-IM5 Hepatocellular Primary P-0004876
#> 4 P-0012198-T01-IM5 Hepatocellular Primary P-0012198
These are patients and samples to ask your collaborator about: Does using both samples make sense? Often times the answer is no. And if not, which sample is the most appropriate one to include? (To get more info for yourself, you can enter the patient ids into the cBioPortal website.)
Check That All cbioportal_ids
Are In cBioPortal
Database
To do this, you need to retrieve the clinical data from cBioPortal
using {cbioportalR}.
You can use the get_clinical_by_sample()
function from
{cbioportalR} to do this. Set the sample_id
parameter to
the cbioportal_ids
from the clinical collaborator’s
file.
Store the sample data in a file called clin_cbio
.
clin_cbio = get_clinical_by_sample(sample_id = clin_collab_df$cbioportal_id)
#> ! No `clinical_attribute` passed. Defaulting to returning
#> all clinical attributes in "msk_impact_2017" study
(You can disregard the warning message for now, though you may be interested in specific clinical attributes later.)
Note: If you are using the public version of cBioPortal, this
function will only query the msk_impact_2017
study.
Notice that you now have 2 clinical files: one given to you by the
collaborator (clin_collab_df
) and one you have retrieved
yourself from cBioPortal (clin_cbio
).
Here’s the header of clin_cbio
:
head(clin_cbio) %>% as.data.frame()
#> uniqueSampleKey
#> 1 UC0wMDAwMDM3LVQwMi1JTTM6bXNrX2ltcGFjdF8yMDE3
#> 2 UC0wMDAwMDM3LVQwMi1JTTM6bXNrX2ltcGFjdF8yMDE3
#> 3 UC0wMDAwMDM3LVQwMi1JTTM6bXNrX2ltcGFjdF8yMDE3
#> 4 UC0wMDAwMDM3LVQwMi1JTTM6bXNrX2ltcGFjdF8yMDE3
#> 5 UC0wMDAwMDM3LVQwMi1JTTM6bXNrX2ltcGFjdF8yMDE3
#> 6 UC0wMDAwMDM3LVQwMi1JTTM6bXNrX2ltcGFjdF8yMDE3
#> uniquePatientKey sampleId patientId
#> 1 UC0wMDAwMDM3Om1za19pbXBhY3RfMjAxNw P-0000037-T02-IM3 P-0000037
#> 2 UC0wMDAwMDM3Om1za19pbXBhY3RfMjAxNw P-0000037-T02-IM3 P-0000037
#> 3 UC0wMDAwMDM3Om1za19pbXBhY3RfMjAxNw P-0000037-T02-IM3 P-0000037
#> 4 UC0wMDAwMDM3Om1za19pbXBhY3RfMjAxNw P-0000037-T02-IM3 P-0000037
#> 5 UC0wMDAwMDM3Om1za19pbXBhY3RfMjAxNw P-0000037-T02-IM3 P-0000037
#> 6 UC0wMDAwMDM3Om1za19pbXBhY3RfMjAxNw P-0000037-T02-IM3 P-0000037
#> studyId clinicalAttributeId value
#> 1 msk_impact_2017 CANCER_TYPE Hepatobiliary Cancer
#> 2 msk_impact_2017 CANCER_TYPE_DETAILED Hepatocellular Carcinoma
#> 3 msk_impact_2017 DNA_INPUT 250
#> 4 msk_impact_2017 FRACTION_GENOME_ALTERED 0.2373
#> 5 msk_impact_2017 MATCHED_STATUS Matched
#> 6 msk_impact_2017 MUTATION_COUNT 3
The sample IDs here are in the sampleId
column. You may
notice that this file is in “long” format and each sample has multiple
rows. Later we will convert this file to “wide” format to do QA checking
on attributes.
But the first thing you want to know is whether you are able to find
all of the cbioportal_ids
from your
clin_collab_df
file in the clin_cbio
file.
To do this, use the setdiff()
function:
setdiff(clin_collab_df$cbioportal_id, clin_cbio$sampleId)
#> [1] "P-0066922-T02-IM7" "P-0070148-T01-IM5"
So there are two sample ID’s from your clinical file
(clin_collab_df
) that are currently not found in cBioPortal
(in your clin_cbio
file). Include these in the list of
cBioPortal questions to ask your collaborator.
(Again, if you want to investigate a bit further, you could enter the patient cBioPortal IDs as queries into the cBioPortal website.)
Check Clinical Data Matches cBioPortal Database
Now we need to check whether clinical information in collaborator’s
file (clin_collab_df
) matches clinical information in
cBioPortal (in your clin_cbio
file).
Look at the clin_collab_df
again:
head(clin_collab_df)
#> # A tibble: 6 × 4
#> cbioportal_id ctype primary_mets patient_id
#> <chr> <chr> <chr> <chr>
#> 1 P-0066922-T02-IM7 Hepatocellular Primary P-0066922
#> 2 P-0009540-T01-IM5 Hepatocellular Primary P-0009540
#> 3 P-0000182-T01-IM3 Hepatocellular Metastasis P-0000182
#> 4 P-0000037-T02-IM3 Hepatocellular Primary P-0000037
#> 5 P-0005357-T01-IM5 Hepatocellular Primary P-0005357
#> 6 P-0007773-T01-IM5 Hepatocellular Metastasis P-0007773
Aside from cbioportal_id
, you have cancer type
(ctype
) and sample type (primary_mets
)
variables. Because it’s a hepatocellular cancer study, all of the
ctype
values will be the same. To double check that, count
ctype
:
clin_collab_df %>% count(ctype)
#> # A tibble: 1 × 2
#> ctype n
#> <chr> <int>
#> 1 Hepatocellular 80
So the only variable you can check in this example is the
primary_mets
. To see if the clin_cbio
file has
an analogous variable to check, first see the attributes that are
available in it.
clin_cbio %>% count(clinicalAttributeId)
#> # A tibble: 18 × 2
#> clinicalAttributeId n
#> <chr> <int>
#> 1 CANCER_TYPE 78
#> 2 CANCER_TYPE_DETAILED 78
#> 3 DNA_INPUT 78
#> 4 FRACTION_GENOME_ALTERED 78
#> 5 MATCHED_STATUS 78
#> 6 METASTATIC_SITE 7
#> 7 MUTATION_COUNT 78
#> 8 ONCOTREE_CODE 78
#> 9 PRIMARY_SITE 78
#> 10 SAMPLE_CLASS 78
#> 11 SAMPLE_COLLECTION_SOURCE 78
#> 12 SAMPLE_COVERAGE 78
#> 13 SAMPLE_TYPE 78
#> 14 SOMATIC_STATUS 78
#> 15 SPECIMEN_PRESERVATION_TYPE 78
#> 16 SPECIMEN_TYPE 78
#> 17 TMB_NONSYNONYMOUS 78
#> 18 TUMOR_PURITY 77
To quickly see values associated with a particular attribute, filter by the attribute and count the values. For example:
clin_cbio %>% filter(clinicalAttributeId=="SAMPLE_TYPE") %>% count(value)
#> # A tibble: 2 × 2
#> value n
#> <chr> <int>
#> 1 Metastasis 7
#> 2 Primary 71
The attribute SAMPLE_TYPE
looks like the appropriate
variable to check primary_mets
against. To do this, we will
convert clin_cbio
to “wide” form (only for the
SAMPLE_TYPE
variable for now), merge it with
clin_collab_df
and then cross-tabulate the 2 variables.
To convert clin_cbio
to “wide” form:
clin_cbio_wide = clin_cbio %>%
select( sampleId, clinicalAttributeId, value) %>%
filter( clinicalAttributeId == "SAMPLE_TYPE") %>%
pivot_wider(names_from = clinicalAttributeId, values_from = value)
Take a look at the “wide” file:
head(clin_cbio_wide) %>% as.data.frame()
#> sampleId SAMPLE_TYPE
#> 1 P-0000037-T02-IM3 Primary
#> 2 P-0000182-T01-IM3 Metastasis
#> 3 P-0000218-T01-IM3 Metastasis
#> 4 P-0000228-T03-IM5 Primary
#> 5 P-0000587-T01-IM3 Metastasis
#> 6 P-0000829-T01-IM3 Primary
Now to check the primary_mets
variable from
clin_collab_df
against the SAMPLE_TYPE
variable from clin_cbio_wide
, merge the files and tabulate
the variables.
clin_merged <- clin_cbio_wide %>% left_join(clin_collab_df, by = c("sampleId" = "cbioportal_id"))
clin_merged %>% select(primary_mets, SAMPLE_TYPE) %>% table()
#> SAMPLE_TYPE
#> primary_mets Metastasis Primary
#> Metastasis 7 1
#> Primary 0 70
There is 1 sample that has a value of “Metastasis” for the
primary_mets
variable but “Primary” for the
SAMPLE_TYPE
variable. To find the sample ID, filter:
clin_merged %>% filter(primary_mets == "Metastasis" & SAMPLE_TYPE == "Primary")
#> # A tibble: 1 × 5
#> sampleId SAMPLE_TYPE ctype primary_mets patient_id
#> <chr> <chr> <chr> <chr> <chr>
#> 1 P-0001324-T01-IM3 Primary Hepatocellular Metastasis P-0001324
Include this sample in the list of questions for your collaborator. Either she will need to update her clinical file with the correct value or you/she will have to notify cBioPortal to update their database.