In patients with metastatic melanoma, the identification and validation of accurate prognostic biomarkers will assist rational treatment planning. Studies based on “-omics” technologies have focussed on a single high-throughput data type such as gene or microRNA transcripts. Occasionally, these features were evaluated in conjunction with limited clinico-pathologic data. With the increased availability of multiple data types, there is a pressing need to tease apart which of these sources contain the most valuable prognostic information. Professor Graham Mann and his group evaluated and integrated several data types derived from the same tumor specimens in AJCC stage III melanoma patients - gene, protein, and microRNA expression as well as clinical, pathologic and mutation information - to determine their relative impact on prognosis. They used classification frameworks based on pre-validation and bootstrap multiple imputation classification to compare the prognostic power of each data source, both individually as well as integratively. They found that the prognostic utility of clinico-pathologic information was not out-performed by various “-omics” platforms. Rather, a combination of clinico-pathologic variables and mRNA expression data performed best. Furthermore, a patient-based classification analysis revealed that the prognostic accuracy of various data types was not the same for different patients, providing useful insights for ongoing developments in the individualized treatment of melanomas patients.
The stage III melanoma cohort microarray data is stored on the Gene Expression Omnibus website with accession number GSE54467.
Explore what is on GEO website for GSE54467, what type of data and information is here?
library(limma)
library(ggfortify)
library(tidyverse)
library(GEOquery)
library(BiocManager)
# run this line if you get a VROOM error
Sys.setenv(VROOM_CONNECTION_SIZE = 131072 * 2)
gset <- getGEO("GSE54467",GSEMatrix =TRUE, getGPL=FALSE, destdir = '.')[[1]]
gset
## Setting destdir will ensure this file is only downloaded once.
## Get phenotype data
pheno <- pData(gset)
colnames(pheno) <- gsub(" ", "_", colnames(pheno))
colnames(pheno) <- sub(":ch1", "", colnames(pheno))
## Get gene expression data
expVal <- exprs(gset)
boxplot(expVal[,1:5])
## Get gene annotations
geneAnno <- fData(gset)
#this is empty (due to getGPL=TRUE resulting in a download error so changing to getGPL=FALSE)
#alternative way to get some annotation
BiocManager::install("illuminaHumanv3.db")
library(illuminaHumanv3.db)
geneAnno <- data.frame(rownames(expVal))
colnames(geneAnno) <- gsub("rownames.expVal.", "PROBEID", colnames(geneAnno))
ENTREZID <- data.frame(mapIds(illuminaHumanv3.db, keys = as.character(geneAnno$PROBEID), column = c("ENTREZID"), keytype = "PROBEID", multiVals = "first"))
colnames(ENTREZID) <- c("ENTREZID")
geneSymb <- data.frame(mapIds(illuminaHumanv3.db, keys = as.character(geneAnno$PROBEID), column = c("SYMBOL"),keytype = "PROBEID", multiVals = "first"))
colnames(geneSymb) <- c("SYMBOL")
geneAnno2 <- cbind(geneAnno,ENTREZID,geneSymb)
fData(gset) <- geneAnno2
pca <- prcomp(t(expVal),scale = TRUE)
autoplot(pca, pheno, colour = "stage_at_primary_diagnosis_5th_edition")
## There is no obvious structure in the PCA plot. However, we don't have the sample batch information.
We would like to see if there are any genes associated with patient survival. Lets start by looking at the survival data.
pheno$status <- pheno$patient_last_status
pheno$survival <- as.numeric(pheno$`survival_from_stage_iii_tumour_banked_(months)`)
pheno %>%
filter(status %in% c("Alive NSR", "Dead Melanoma")) %>%
ggplot(aes(x = survival/12, colour = status)) + geom_density() + theme_classic() +
geom_vline(xintercept = c(1,4), linetype = "dashed")
Graham and his team decided for a few reasons that it would be advantages to compare patients that did not survive one year and survived more than four years.
pheno$survClass <- NA
pheno$survClass[pheno$survival/12>=4& pheno$status == "Alive NSR"] <- "Good"
pheno$survClass[pheno$survival/12<=1 & pheno$status == "Dead Melanoma"] <- "Poor"
table(pheno$survClass)
Now lets find differentially expressed genes
## Make a design matrix
design = model.matrix(~survClass, pheno)
dim(design)
dim(pheno)
## Fit linear models
fit = lmFit(gset[,rownames(design)], design, method = "robust") # I wonder why I used a robust fit?
efit = eBayes(fit,trend = TRUE)
## Top genes
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)
volcanoplot(efit, coef = "survClassPoor", highlight = 10, names = efit$genes$SYMBOL)
## Pull out top 10 genes
top10 <- topTable(efit, n=10)
## Pull out top 10 genes
top100 <- topTable(efit, n=100)
These genes might be hard to interpret as there are a lot of them. Lets perform some form of pathway analysis.
load(url("http://bioinf.wehi.edu.au/software/MSigDB/human_c2_v5p2.rdata"))
idx <- ids2indices(Hs.c2,id=efit$genes$ENTREZID)
cam<- camera(expVal[,rownames(design)],idx,design,contrast="survClassPoor",trend.var = TRUE)
head(cam,10)
library(gplots)
top100DE <- rownames(dt_fdr)[1:100]
columnColour = pheno$survClass
columnColour[pheno$survClass== 'Poor'] = 'red'
columnColour[pheno$survClass== 'Good'] = 'white'
useColumns <- !is.na(columnColour)
coolmap(expVal[top100DE,useColumns ], ColSideColors = columnColour[useColumns])
## Compare to top 100 variable genes
top100Var = names(sort(-apply(expVal,1,var)))[1:100]
coolmap(expVal[top100Var,useColumns ], ColSideColors = columnColour[useColumns])
##You may need to reset the graphics plot pane using dev.off() if you get an error plotting these heatmaps.
top10Var = names(sort(-apply(expVal,1,var)))[1:10]
We will start with some feature selection. Determine a vector of the row numbers of the 10 largest (by absolute value) t-statistics and subset the table of gene expression measurements by it.
# Choose top one hundred genes and transpose.
X <- t(expVal[rownames(top10), rownames(design)])
# Select outcome variable.
Y <- as.factor(pheno[rownames(design), "survClass"])
# create new dataset.
data <- data.frame(Y,X)
Use SVM to model our prognosis class with the gene expression
library(e1071)
fit <- svm(Y~., data)
Calculate a resubstitution error rate by resubstituting our original data into the model.
predSVM <- predict(fit, data) #Here we have predicted the classes of our original data.
mean(predSVM != data$Y)
By today, our group should be ready to update myself or one of the demonstrators on your progress so far with the reproducible report assessment. Questions we will be keen to hear from you about will include:
Select a classifier that is not SVM and write a report demonstrating that you can build and evaluate a classifier using 5-fold cross-validation with the Stage III Melanoma case study (which you will work on in weeks 11 and 12). It is optional whether you include feature selection outside or within your cross-validation framework. Present your results and code in a reproducible report (e.g. Rmarkdown report). Be complete in writing up your findings. This report contributes to 4% of your final assessment.
The marking is as follows:
The report is due 11:59pm Friday 19th May.