1 Case study - Stage III melanoma cohort.

1.1 Background

In patients with metastatic melanoma, the identification and validation of accurate prognostic biomarkers will assist rational treatment planning. Studies based on “-omics” technologies have focussed on a single high-throughput data type such as gene or microRNA transcripts. Occasionally, these features were evaluated in conjunction with limited clinico-pathologic data. With the increased availability of multiple data types, there is a pressing need to tease apart which of these sources contain the most valuable prognostic information. Professor Graham Mann and his group evaluated and integrated several data types derived from the same tumor specimens in AJCC stage III melanoma patients - gene, protein, and microRNA expression as well as clinical, pathologic and mutation information - to determine their relative impact on prognosis. They used classification frameworks based on pre-validation and bootstrap multiple imputation classification to compare the prognostic power of each data source, both individually as well as integratively. They found that the prognostic utility of clinico-pathologic information was not out-performed by various “-omics” platforms. Rather, a combination of clinico-pathologic variables and mRNA expression data performed best. Furthermore, a patient-based classification analysis revealed that the prognostic accuracy of various data types was not the same for different patients, providing useful insights for ongoing developments in the individualized treatment of melanomas patients.

1.2 Finding data

The stage III melanoma cohort microarray data is stored on the Gene Expression Omnibus website with accession number GSE54467.

Explore what is on GEO website for GSE54467, what type of data and information is here?

1.3 Lets use a bioconductor package GEOquery

library(limma)
library(ggfortify)
library(tidyverse)
library(GEOquery)
library(BiocManager)

# run this line if you get a VROOM error
Sys.setenv(VROOM_CONNECTION_SIZE = 131072 * 2)

gset <- getGEO("GSE54467", GSEMatrix = TRUE, getGPL = FALSE, destdir = ".")[[1]]
gset

## Setting destdir will ensure this file is only downloaded once.




## Get phenotype data
pheno <- pData(gset)
colnames(pheno) <- gsub(" ", "_", colnames(pheno))
colnames(pheno) <- sub(":ch1", "", colnames(pheno))


## Get gene expression data
expVal <- exprs(gset)

boxplot(expVal[, 1:5])

## Get gene annotations
geneAnno <- fData(gset)
# this is empty (due to getGPL=TRUE resulting in a download error so changing
# to getGPL=FALSE)


# alternative way to get some annotation
BiocManager::install("illuminaHumanv3.db")
library(illuminaHumanv3.db)
geneAnno <- data.frame(rownames(expVal))
colnames(geneAnno) <- gsub("rownames.expVal.", "PROBEID", colnames(geneAnno))
ENTREZID <- data.frame(mapIds(illuminaHumanv3.db, keys = as.character(geneAnno$PROBEID),
    column = c("ENTREZID"), keytype = "PROBEID", multiVals = "first"))
colnames(ENTREZID) <- c("ENTREZID")
geneSymb <- data.frame(mapIds(illuminaHumanv3.db, keys = as.character(geneAnno$PROBEID),
    column = c("SYMBOL"), keytype = "PROBEID", multiVals = "first"))
colnames(geneSymb) <- c("SYMBOL")
geneAnno2 <- cbind(geneAnno, ENTREZID, geneSymb)
fData(gset) <- geneAnno2

1.4 Data exploration.

pca <- prcomp(t(expVal), scale = TRUE)
autoplot(pca, pheno, colour = "stage_at_primary_diagnosis_5th_edition")
## There is no obvious structure in the PCA plot. However, we don't have the
## sample batch information.

1.5 Association with survival

We would like to see if there are any genes associated with patient survival. Lets start by looking at the survival data.

pheno$status <- pheno$patient_last_status
pheno$survival <- as.numeric(pheno$`survival_from_stage_iii_tumour_banked_(months)`)

pheno %>%
    filter(status %in% c("Alive NSR", "Dead Melanoma")) %>%
    ggplot(aes(x = survival/12, colour = status)) + geom_density() + theme_classic() +
    geom_vline(xintercept = c(1, 4), linetype = "dashed")

Graham and his team decided for a few reasons that it would be advantages to compare patients that did not survive one year and survived more than four years.

pheno$survClass <- NA
pheno$survClass[pheno$survival/12 >= 4 & pheno$status == "Alive NSR"] <- "Good"
pheno$survClass[pheno$survival/12 <= 1 & pheno$status == "Dead Melanoma"] <- "Poor"

table(pheno$survClass)

Now lets find differentially expressed genes

## Make a design matrix
design = model.matrix(~survClass, pheno)
dim(design)
dim(pheno)

## Fit linear models
fit = lmFit(gset[, rownames(design)], design, method = "robust")  # I wonder why I used a robust fit?
efit = eBayes(fit, trend = TRUE)

## Top genes
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)
volcanoplot(efit, coef = "survClassPoor", highlight = 10, names = efit$genes$SYMBOL)

## Pull out top 10 genes
top10 <- topTable(efit, n = 10)

## Pull out top 10 genes
top100 <- topTable(efit, n = 100)

1.6 Pathway analysis

These genes might be hard to interpret as there are a lot of them. Lets perform some form of pathway analysis.

load(url("http://bioinf.wehi.edu.au/software/MSigDB/human_c2_v5p2.rdata"))
idx <- ids2indices(Hs.c2, id = efit$genes$ENTREZID)
cam <- camera(expVal[, rownames(design)], idx, design, contrast = "survClassPoor",
    trend.var = TRUE)
head(cam, 10)

1.7 Heatmap

library(gplots)

top100DE <- rownames(dt_fdr)[1:100]

columnColour = pheno$survClass
columnColour[pheno$survClass == "Poor"] = "red"
columnColour[pheno$survClass == "Good"] = "white"

useColumns <- !is.na(columnColour)

coolmap(expVal[top100DE, useColumns], ColSideColors = columnColour[useColumns])


## Compare to top 100 variable genes
top100Var = names(sort(-apply(expVal, 1, var)))[1:100]
coolmap(expVal[top100Var, useColumns], ColSideColors = columnColour[useColumns])
top10Var = names(sort(-apply(expVal, 1, var)))[1:10]

1.8 Implement SVM classifier

We will start with some feature selection. Determine a vector of the row numbers of the 10 largest (by absolute value) t-statistics and subset the table of gene expression measurements by it.

# Choose top one hundred genes and transpose.
X <- t(expVal[rownames(top10), rownames(design)])
# Select outcome variable.
Y <- as.factor(pheno[rownames(design), "survClass"])

# create new dataset.
data <- data.frame(Y, X)

Use SVM to model our prognosis class with the gene expression

library(e1071)
fit <- svm(Y ~ ., data)

Calculate a resubstitution error rate by resubstituting our original data into the model.

predSVM <- predict(fit, data)  #Here we have predicted the classes of our original data.
mean(predSVM != data$Y)

1.9 Research Project 2 group work.

At the start of next week’s workshop, each group will be asked to present 2-4 slides informally (which were prepared before class) for 2-3 minutes to one of our demonstrators quickly explaining:

  • What is the broad research question of your chosen manuscript.
  • What does the data look like.
  • A draft flow diagram of their analysis.
  • Some interesting figures/results you hope to reproduce.

You may wish to start getting organised for this in today’s workshop.

1.10 Module 4 lab report

Select a classifier that is not SVM and write a report demonstrating that you can build and evaluate a classifier using 5-fold cross-validation with the Stage III Melanoma case study. It is optional whether you include feature selection outside or within your cross-validation framework. Present your results and code in a reproducible report (e.g. Rmarkdown report). Be complete in writing up your findings. This report contributes to 4% of your final assessment.

The marking is as follows:

  • 1 mark is given for the DE analysis.
  • 1 mark is given for building the classifier.
  • 1 marks are given for generating a cross-validation code.
  • 1 mark is given for writing an explanation of what you have done.

The report is due 11:59pm Friday 27th May.