Aims:
Construct a classification rule for an omics case study.
Write a for-loop.
To complete your lab report you may need to install the
class
package, randomForest
package and/or the
e1071
package.
Please attempt these coding exercises. Please ask your demonstrators for help if they are tough. Ask for further help on the dsicussion board or in the workshop.
for()
A for-loop can be used to evaluate a certain expression for a sequence of numbers or characters. The syntax looks like this…
aa <- 10:6
for(bb in aa){
print(bb)
}
This is effectively the same as…
bb <- aa[1]
print(bb)
bb <- aa[2]
print(bb)
bb <- aa[3]
print(bb)
bb <- aa[4]
print(bb)
bb <- aa[5]
print(bb)
Please spend one minute attempting the following before looking at the code.
## [1] 5 6 7 8 9 10
x = rep(NA, 6)
x
for (i in 1:6) {
x[i] <- i + 4
}
x
## [1] "b" "c" "d" "e" "f" "g" "h" "i"
x = rep(NA, 8)
for (i in 1:8) {
x[i] <- letters[i + 1]
}
x
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 2 3 4 5 6
## [2,] 2 4 6 8 10 12
## [3,] 3 6 9 12 15 18
## [4,] 4 8 12 16 20 24
## [5,] 5 10 15 20 25 30
mymat <- matrix(NA, nrow = 5, ncol = 6)
for (i in 1:nrow(mymat)) {
for (j in 1:ncol(mymat)) {
mymat[i, j] = i * j
}
}
mymat
In patients with metastatic melanoma, the identification and validation of accurate prognostic biomarkers will assist rational treatment planning. Studies based on “-omics” technologies have focussed on a single high-throughput data type such as gene or microRNA transcripts. Occasionally, these features were evaluated in conjunction with limited clinico-pathologic data. With the increased availability of multiple data types, there is a pressing need to tease apart which of these sources contain the most valuable prognostic information. Professor Graham Mann and his group evaluated and integrated several data types derived from the same tumor specimens in AJCC stage III melanoma patients - gene, protein, and microRNA expression as well as clinical, pathologic and mutation information - to determine their relative impact on prognosis. They used classification frameworks based on pre-validation and bootstrap multiple imputation classification to compare the prognostic power of each data source, both individually as well as integratively. They found that the prognostic utility of clinico-pathologic information was not out-performed by various “-omics” platforms. Rather, a combination of clinico-pathologic variables and mRNA expression data performed best. Furthermore, a patient-based classification analysis revealed that the prognostic accuracy of various data types was not the same for different patients, providing useful insights for ongoing developments in the individualized treatment of melanomas patients.
The stage III melanoma cohort microarray data is stored on the Gene Expression Omnibus website with accession number GSE54467.
Explore what is on GEO website for GSE54467, what type of data and information is here?
library(limma)
library(ggfortify)
library(tidyverse)
library(GEOquery)
library(BiocManager)
# run this line if you get a VROOM error
Sys.setenv(VROOM_CONNECTION_SIZE = 131072 * 2)
gset <- getGEO("GSE54467", GSEMatrix = TRUE, getGPL = FALSE, destdir = ".")[[1]]
gset
## Setting destdir will ensure this file is only downloaded once.
## Get phenotype data
pheno <- pData(gset)
colnames(pheno) <- gsub(" ", "_", colnames(pheno))
colnames(pheno) <- sub(":ch1", "", colnames(pheno))
## Get gene expression data
expVal <- exprs(gset)
boxplot(expVal[, 1:5])
## Get gene annotations
geneAnno <- fData(gset)
# this is empty (due to getGPL=TRUE resulting in a download error so changing
# to getGPL=FALSE)
# alternative way to get some annotation
# BiocManager::install('illuminaHumanv3.db')
library(illuminaHumanv3.db)
geneAnno <- data.frame(rownames(expVal))
colnames(geneAnno) <- gsub("rownames.expVal.", "PROBEID", colnames(geneAnno))
ENTREZID <- data.frame(mapIds(illuminaHumanv3.db, keys = as.character(geneAnno$PROBEID),
column = c("ENTREZID"), keytype = "PROBEID", multiVals = "first"))
colnames(ENTREZID) <- c("ENTREZID")
geneSymb <- data.frame(mapIds(illuminaHumanv3.db, keys = as.character(geneAnno$PROBEID),
column = c("SYMBOL"), keytype = "PROBEID", multiVals = "first"))
colnames(geneSymb) <- c("SYMBOL")
geneAnno2 <- cbind(geneAnno, ENTREZID, geneSymb)
fData(gset) <- geneAnno2
We would like to see if there are any genes associated with patient survival. Lets start by looking at the survival data.
pheno$status <- pheno$patient_last_status
pheno$survival <- as.numeric(pheno$`survival_from_stage_iii_tumour_banked_(months)`)/12
Graham and his team decided for a few reasons that it would be advantages to compare patients that did not survive one year and survived more than four years.
pheno$survClass <- NA
pheno$survClass[pheno$survival >= 4 & pheno$status == "Alive NSR"] <- "Good"
pheno$survClass[pheno$survival <= 1 & pheno$status == "Dead Melanoma"] <- "Poor"
table(pheno$survClass)
Now lets find differentially expressed genes
## Make a design matrix
design = model.matrix(~survClass, pheno)
dim(design)
dim(pheno)
## Fit linear models
fit = lmFit(gset[, rownames(design)], design, method = "robust") # Why did I used a robust fit?
efit = eBayes(fit, trend = TRUE)
## How many genes are significant at FDR < 0.05
dt_fdr <- decideTests(efit)
summary(dt_fdr)
## Pull out top 10 genes
top10 <- topTable(efit, n = 10)
We will start with some feature selection. Determine a vector of the row numbers of the 10 largest (by absolute value) t-statistics and subset the table of gene expression measurements by it.
# Choose top one hundred genes and transpose.
X <- t(expVal[rownames(top10), rownames(design)])
# Select outcome variable.
Y <- as.factor(pheno[rownames(design), "survClass"])
# create new dataset.
data <- data.frame(Y, X)
n <- nrow(data) # The number of observations
# Create somewhere to store our predictions
test <- factor(rep(NA, n), levels = c("Good", "Poor"))
Use SVM to model our prognosis class with the gene expression
library(e1071)
fit <- svm(Y ~ ., data)
Calculate a resubstitution error rate by resubstituting our original data into the model.
predSVM <- predict(fit, data) #Here we have predicted the classes of our original data.
mean(predSVM != data$Y)
rep()
and sample()
set.seed(51773)
n <- nrow(data) # The number of observations
# Create a vector with the same amount of train and test labels. There are
# better ways to do this. This is my way.
trainTest <- rep(c("train", "test"), 100)
trainTest <- trainTest[1:n]
# Reorder these to avoid systematic ordering bias
trainTestSample <- sample(trainTest, n, replace = FALSE)
# Split data into a training set and a test set.
dataTrain <- data[trainTestSample == "train", ]
dataTest <- data[trainTestSample == "test", ]
Train a support vector machine on our training set and then evaluate it on the test set.
fit <- svm(Y ~ ., dataTrain)
predSVM <- predict(fit, dataTest)
mean(predSVM != dataTest$Y)
Start by creating two folds.
set.seed(51773)
n <- nrow(data) # The number of observations
# Create a vector with the same amount of fold labels
fold <- rep(c("1", "2"), 100)
fold <- fold[1:n]
# Reorder these to avoid systematic ordering bias
foldSample <- sample(fold, n, replace = FALSE)
Create somewhere to store our predictions
# If your response isn't a factor you might just want rep(NA,n)
predSVM <- factor(rep(NA, n), levels = c("Good", "Poor"))
Test on fold 1. Train on fold 2
dataTest <- data[foldSample == 1, ]
dataTrain <- data[foldSample != 1, ]
fit <- svm(Y ~ ., dataTrain)
predSVM[foldSample == 1] <- predict(fit, dataTest)
Test on fold 2. Train on fold 1
dataTest <- data[foldSample == 2, ]
dataTrain <- data[foldSample != 2, ]
fit <- svm(Y ~ ., dataTrain)
predSVM[foldSample == 2] <- predict(fit, dataTest)
Calculate error
mean(predSVM != data$Y)
Modify the previous code to implement 3-fold cross validation with a for-loop.
Start by creating three folds.
set.seed(51773)
n <- nrow(data) # The number of observations
nfolds <- 3
# Create a vector with the same amount of fold labels
fold <- rep(1:nfolds, 100)
fold <- fold[1:n]
# Reorder these to avoid systematic ordering bias
foldSample <- sample(fold, n, replace = FALSE)
Create somewhere to store our predictions
# If your response isn't a factor you might just want rep(NA,n)
predSVM <- factor(rep(NA, n), levels = c("Good", "Poor"))
Test on fold i. Train on the rest
for (i in 1:nfolds) {
dataTest <- data[foldSample == i, ]
dataTrain <- data[foldSample != i, ]
fit <- svm(Y ~ ., dataTrain)
predSVM[foldSample == i] <- predict(fit, dataTest)
}
Calculate error
mean(predSVM != data$Y)
Select a classifier that is not SVM and write a report demonstrating that you can build and evaluate a classifier using 5-fold cross-validation with the Stage III Melanoma case study. It is optional whether you include feature selection outside or within your cross-validation framework. Present your results and code in a reproducible report (e.g. Rmarkdown report). Be complete in writing up your findings. This report contributes to 4% of your final assessment.
The marking is as follows:
The report is due 11:59pm Friday 17th May.