Aims: 
 Construct a classification rule for an omics case study.
 Write a for-loop.

1 Group work.

As a group present 2-4 slides (which were prepared before class) for 2-3 minutes quickly explaining:

What is the broad research question of your chosen manuscript.
What does the data look like.
A draft flow diagram of their analysis.
Some interesting figures/results you hope to reproduce.

2 Packages

To complete your lab report you may need to install the class package, randomForest package and/or the e1071 package.

3 Coding exercises

Please attempt these coding exercises. Please ask your demonstrators for help if they are tough. Ask for further help on the dsicussion board or in the lecture.

3.1 The function `for()`

A for-loop can be used to evaluate a certain expression for a sequence of numbers or characters. The syntax looks like this…

aa <- 10:6

for(bb in aa){

print(bb)

}

This is effectively the same as…

bb <- aa[1]

print(bb)

bb <- aa[2]

print(bb)

bb <- aa[3]

print(bb)

bb <- aa[4]

print(bb)

bb <- aa[5]

print(bb)

Please spend one minute attempting the following before looking at the code.

Lets start by writing some code using a for loop to produce the following output.

## [1]  5  6  7  8  9 10

x = rep(NA, 6)
x

for (i in 1:6) {
    x[i] <- i + 4
}
x

Please write some code using a for loop to produce the following output. Hint: have a look at the vector letters in R.

## [1] "b" "c" "d" "e" "f" "g" "h" "i"

x = rep(NA, 8)
for (i in 1:8) {
    x[i] <- letters[i + 1]
}
x

Create a 5 x 6 matrix (of 5 rows and 6 columns). For each row and for each column, assign the value of an element in the matrix based on it’s position: the product of the row number and column number.

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    2    3    4    5    6
## [2,]    2    4    6    8   10   12
## [3,]    3    6    9   12   15   18
## [4,]    4    8   12   16   20   24
## [5,]    5   10   15   20   25   30

mymat <- matrix(NA, nrow = 5, ncol = 6)
for (i in 1:nrow(mymat)) {
    for (j in 1:ncol(mymat)) {
        mymat[i, j] = i * j
    }
}
mymat

4 Case study - Stage III melanoma cohort.

4.1 Background

In patients with metastatic melanoma, the identification and validation of accurate prognostic biomarkers will assist rational treatment planning. Studies based on “-omics” technologies have focussed on a single high-throughput data type such as gene or microRNA transcripts. Occasionally, these features were evaluated in conjunction with limited clinico-pathologic data. With the increased availability of multiple data types, there is a pressing need to tease apart which of these sources contain the most valuable prognostic information. Professor Graham Mann and his group evaluated and integrated several data types derived from the same tumor specimens in AJCC stage III melanoma patients - gene, protein, and microRNA expression as well as clinical, pathologic and mutation information - to determine their relative impact on prognosis. They used classification frameworks based on pre-validation and bootstrap multiple imputation classification to compare the prognostic power of each data source, both individually as well as integratively. They found that the prognostic utility of clinico-pathologic information was not out-performed by various “-omics” platforms. Rather, a combination of clinico-pathologic variables and mRNA expression data performed best. Furthermore, a patient-based classification analysis revealed that the prognostic accuracy of various data types was not the same for different patients, providing useful insights for ongoing developments in the individualized treatment of melanomas patients.

4.2 Read in data with GEOquery

The stage III melanoma cohort microarray data is stored on the Gene Expression Omnibus website with accession number GSE54467.

Explore what is on GEO website for GSE54467, what type of data and information is here?

library(limma)
library(ggfortify)
library(tidyverse)
library(GEOquery)
library(BiocManager)

# run this line if you get a VROOM error
Sys.setenv(VROOM_CONNECTION_SIZE = 131072 * 2)

gset <- getGEO("GSE54467", GSEMatrix = TRUE, getGPL = FALSE, destdir = ".")[[1]]
gset

## Setting destdir will ensure this file is only downloaded once.




## Get phenotype data
pheno <- pData(gset)
colnames(pheno) <- gsub(" ", "_", colnames(pheno))
colnames(pheno) <- sub(":ch1", "", colnames(pheno))


## Get gene expression data
expVal <- exprs(gset)

boxplot(expVal[, 1:5])

## Get gene annotations
geneAnno <- fData(gset)
# this is empty (due to getGPL=TRUE resulting in a download error so changing
# to getGPL=FALSE)


# alternative way to get some annotation
# BiocManager::install('illuminaHumanv3.db')
library(illuminaHumanv3.db)
geneAnno <- data.frame(rownames(expVal))
colnames(geneAnno) <- gsub("rownames.expVal.", "PROBEID", colnames(geneAnno))
ENTREZID <- data.frame(mapIds(illuminaHumanv3.db, keys = as.character(geneAnno$PROBEID),
    column = c("ENTREZID"), keytype = "PROBEID", multiVals = "first"))
colnames(ENTREZID) <- c("ENTREZID")
geneSymb <- data.frame(mapIds(illuminaHumanv3.db, keys = as.character(geneAnno$PROBEID),
    column = c("SYMBOL"), keytype = "PROBEID", multiVals = "first"))
colnames(geneSymb) <- c("SYMBOL")
geneAnno2 <- cbind(geneAnno, ENTREZID, geneSymb)
fData(gset) <- geneAnno2

4.3 Association with survival

We would like to see if there are any genes associated with patient survival. Lets start by looking at the survival data.

pheno$status <- pheno$patient_last_status
pheno$survival <- as.numeric(pheno$`survival_from_stage_iii_tumour_banked_(months)`)/12

Graham and his team decided for a few reasons that it would be advantages to compare patients that did not survive one year and survived more than four years.

pheno$survClass <- NA
pheno$survClass[pheno$survival >= 4 & pheno$status == "Alive NSR"] <- "Good"
pheno$survClass[pheno$survival <= 1 & pheno$status == "Dead Melanoma"] <- "Poor"

table(pheno$survClass)

Now lets find differentially expressed genes

## Make a design matrix
design = model.matrix(~survClass, pheno)
dim(design)
dim(pheno)

## Fit linear models
fit = lmFit(gset[, rownames(design)], design, method = "robust")  # Why did I used a robust fit?
efit = eBayes(fit, trend = TRUE)

## How many genes are significant at FDR < 0.05
dt_fdr <- decideTests(efit)
summary(dt_fdr)

## Pull out top 10 genes
top10 <- topTable(efit, n = 10)

4.4 Implement SVM classifier

We will start with some feature selection. Determine a vector of the row numbers of the 10 largest (by absolute value) t-statistics and subset the table of gene expression measurements by it.

# Choose top one hundred genes and transpose.
X <- t(expVal[rownames(top10), rownames(design)])
# Select outcome variable.
Y <- as.factor(pheno[rownames(design), "survClass"])

# create new dataset.
data <- data.frame(Y, X)
n <- nrow(data)  # The number of observations
# Create somewhere to store our predictions
test <- factor(rep(NA, n), levels = c("Good", "Poor"))

Use SVM to model our prognosis class with the gene expression

library(e1071)
fit <- svm(Y ~ ., data)

Calculate a resubstitution error rate by resubstituting our original data into the model.

predSVM <- predict(fit, data)  #Here we have predicted the classes of our original data.
mean(predSVM != data$Y)

4.5 Test-set error

Test error should feel very intuitive. A dataset is used to train a model.
A separate dataset is then used to assess the performance of that model.
We don’t have two data sets. Instead, lets split what we have into a training dataset and a test dataset.
Note our use of rep() and sample()

set.seed(51773)
n <- nrow(data)  # The number of observations

# Create a vector with the same amount of train and test labels.  There are
# better ways to do this. This is my way.
trainTest <- rep(c("train", "test"), 100)
trainTest <- trainTest[1:n]

# Reorder these to avoid systematic ordering bias
trainTestSample <- sample(trainTest, n, replace = FALSE)

# Split data into a training set and a test set.
dataTrain <- data[trainTestSample == "train", ]
dataTest <- data[trainTestSample == "test", ]

Train a support vector machine on our training set and then evaluate it on the test set.

fit <- svm(Y ~ ., dataTrain)
predSVM <- predict(fit, dataTest)
mean(predSVM != dataTest$Y)

4.6 2-fold cross validation for SVM

Start by creating two folds.

set.seed(51773)
n <- nrow(data)  # The number of observations

# Create a vector with the same amount of fold labels
fold <- rep(c("1", "2"), 100)
fold <- fold[1:n]

# Reorder these to avoid systematic ordering bias
foldSample <- sample(fold, n, replace = FALSE)

Create somewhere to store our predictions

# If your response isn't a factor you might just want rep(NA,n)
predSVM <- factor(rep(NA, n), levels = c("Good", "Poor"))

Test on fold 1. Train on fold 2

dataTest <- data[foldSample == 1, ]
dataTrain <- data[foldSample != 1, ]

fit <- svm(Y ~ ., dataTrain)
predSVM[foldSample == 1] <- predict(fit, dataTest)

Test on fold 2. Train on fold 1

dataTest <- data[foldSample == 2, ]
dataTrain <- data[foldSample != 2, ]

fit <- svm(Y ~ ., dataTrain)
predSVM[foldSample == 2] <- predict(fit, dataTest)

Calculate error

mean(predSVM != data$Y)

4.7 For-loop: 3-fold cross validation

Modify the previous code to implement 3-fold cross validation with a for-loop.

Start by creating three folds.

set.seed(51773)
n <- nrow(data)  # The number of observations
nfolds <- 3

# Create a vector with the same amount of fold labels
fold <- rep(1:nfolds, 100)
fold <- fold[1:n]

# Reorder these to avoid systematic ordering bias
foldSample <- sample(fold, n, replace = FALSE)

Create somewhere to store our predictions

# If your response isn't a factor you might just want rep(NA,n)
predSVM <- factor(rep(NA, n), levels = c("Good", "Poor"))

Test on fold i. Train on the rest

for (i in 1:nfolds) {

    dataTest <- data[foldSample == i, ]
    dataTrain <- data[foldSample != i, ]

    fit <- svm(Y ~ ., dataTrain)
    predSVM[foldSample == i] <- predict(fit, dataTest)

}

Calculate error

mean(predSVM != data$Y)

4.8 Module 4 lab report

Select a classifier that is not SVM and write a report demonstrating that you can build and evaluate a classifier using 5-fold cross-validation with the Stage III Melanoma case study. It is optional whether you include feature selection outside or within your cross-validation framework. Present your results and code in a reproducible report (e.g. Rmarkdown report). Be complete in writing up your findings. This report contributes to 2.5% of your final assessment.

The marking is as follows:

1 mark is given for the DE analysis.
1 mark is given for building the classifier.
1 marks are given for generating a cross-validation code.
1 mark is given for writing an explanation of what you have done.

The report is due 11:59pm Friday 27th May.

AMED3002 Lab 11

Supervised learning / class prediction

1 Group work.

2 Packages

3 Coding exercises

3.1 The function `for()`

4 Case study - Stage III melanoma cohort.

4.1 Background

4.2 Read in data with GEOquery

4.3 Association with survival

4.4 Implement SVM classifier

4.5 Test-set error

4.6 2-fold cross validation for SVM

4.7 For-loop: 3-fold cross validation

4.8 Module 4 lab report

AMED3002 Lab 11

Supervised learning / class prediction

1 Group work.

2 Packages

3 Coding exercises

3.1 The function for()

4 Case study - Stage III melanoma cohort.

4.1 Background

4.2 Read in data with GEOquery

4.3 Association with survival

4.4 Implement SVM classifier

4.5 Test-set error

4.6 2-fold cross validation for SVM

4.7 For-loop: 3-fold cross validation

4.8 Module 4 lab report

3.1 The function `for()`