Aims: 
 Perform a differential expression analysis. 
 Correct for multiple hypothesis testing. 
 Visualise results.

1 Preparation for lab

Please make an effort to complete your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations. Try to use your expanding knowledge of R to modify this example code in places where you think it can be altered/improved. Ultimately, this workflow will go from obtaining data from GEO through to pathway analysis, however we will complete the workflow in 3 stages. In week 7 we will focus on obtaining the data and getting through to PCA analysis. In week 8 we will perfom differential gene expression analysis and some visualisations, and in week 9 we will get to the pathway analysis. The workflow uses popular bioconductor packages and is adapted from the published RNAseq 1-2-3 workflow (https://f1000research.com/articles/5-1408).

2 Data exercises

2.1 B cells and EBV infected B cells (LCLs)

The following data comes from a manuscript “Evidence from genome wide association studies implicates reduced control of Epstein-Barr virus infection in multiple sclerosis susceptibility” (https://www.ncbi.nlm.nih.gov/pubmed/31039804). There were multiple goals for the RNAseq component of project:

Identify the gene expression changes that occur when B cells are infected with EBV and transformed into LCLs.
Determine which pathways are over-represented in the differentially expressed genes.
Assess if there are more MS risk genes then expected by chance in the differentially expressed genes.

Use this dataset as an opportunity to explore the behaviour of a small RNA-Seq experiment. In this lab we will start by just processing and exploring the data.

IMPORTANT: The code from week 8 continues on from the code generated in week 7. The new steps introduced this week start from step 5.

1A. Install / load required R packages

library(tidyverse)
library(ggfortify)
library(RColorBrewer)



# If any packages of these required are not installed, they can be installed by
# first installing bioconductor: if (!requireNamespace('BiocManager', quietly =
# TRUE)) install.packages('BiocManager') then using
# BiocManager::install('PACKAGENAME')

# load packages to be used in the analysis
library(GEOquery)
library(org.Hs.eg.db)
library(limma)
library(edgeR)
library(Glimma)
library(Homo.sapiens)
library(gplots)
library(dplyr)

1B. Read in the data.

# The count data for this GEO submission is available as a supplementary file on
# GEO. You only want to run this once...

sfiles <- getGEOSuppFiles("GSE126379")
fnames <- rownames(sfiles)

# There is only one supplemental file
Data = read.delim(fnames[1], header = TRUE)
head(Data)

##   Chromosome Gene.Name Gene.ID Sum.of.Exon.Length.of.Gene
## 1      chr10      A1CF    A1CF                       9529
## 2      chr10     ABCC2   ABCC2                       5051
## 3      chr10      ABI1    ABI1                       3776
## 4      chr10    ABLIM1  ABLIM1                       8303
## 5      chr10    ACADSB  ACADSB                       5941
## 6      chr10     ACBD5   ACBD5                       4066
##   Reads.Counts..BLCL_A1. RPKM..BLCL_A1. Reads.Counts..BLCL_A2. RPKM..BLCL_A2.
## 1                      2      0.0113048                      1    0.009967274
## 2                     21      0.2239352                     42    0.789760929
## 3                   1975     28.1718559                   1436   36.119869850
## 4                    786      5.0987985                   2619   29.958784090
## 5                    756      6.8539761                    543    8.680884936
## 6                    535      7.0870705                    421    9.834186588
##   Reads.Counts..BLCL_B1. RPKM..BLCL_B1. Reads.Counts..BLCL_B2. RPKM..BLCL_B2.
## 1                      1     0.01277064                      4     0.02185651
## 2                     10     0.24092549                     89     0.91744656
## 3                    563    18.14414625                   2307    31.81147083
## 4                    220     3.22439153                   4464    27.99350563
## 5                    210     4.30149944                    877     7.68613807
## 6                    200     5.98580756                    674     8.63099098
##   Reads.Counts..BLCL_C1. RPKM..BLCL_C1. Reads.Counts..BLCL_C2. RPKM..BLCL_C2.
## 1                      0      0.0000000                     11       0.122567
## 2                     13      0.2041644                     63       1.324315
## 3                    695     14.6004734                   1235      34.726663
## 4                    400      3.8215458                   2600      33.248066
## 5                    264      3.5249949                    568      10.151188
## 6                    304      5.9308963                    492      12.847712
##   Reads.Counts..BLCL_D1. RPKM..BLCL_D1. Reads.Counts..BLCL_D2. RPKM..BLCL_D2.
## 1                      2     0.02147553                     10     0.04834247
## 2                     25     0.50643522                    163     1.48657342
## 3                    847    22.95158503                   2806    34.23196323
## 4                    626     7.71437158                   5804    32.20093570
## 5                    272     4.68457946                   1065     8.25783462
## 6                    250     6.29120581                   1139    12.90424131
##   Reads.Counts..BLCL_E1. RPKM..BLCL_E1. Reads.Counts..BLCL_E2. RPKM..BLCL_E2.
## 1                      1     0.01387007                      7     0.04338079
## 2                     54     1.41300065                    130     1.51989185
## 3                    986    34.51206142                   2471    38.64447949
## 4                    177     2.81750179                   4484    31.89167049
## 5                    302     6.71851610                   1039    10.32767820
## 6                    313    10.17426241                    932    13.53615256

# Next time you can just run... fnames <-
# 'GSE126379/GSE126379_BLCL_processed.txt.gz' Data =
# read.delim(fnames[1],header=TRUE) head(Data)

Tidy up the dataset and set it up for downstream analysis

# MANIPULATE GENE ANNOTATIONS

# Genes have multiple 'identifiers' in different databases. Later on we will need
# identifiers from the ENTREZ database. To save us some pain later, lets extract
# these now using the select() function from the AnnotationDbi package. The
# ENTREZ identifiers can be found in the Homo.sapiens package.
ENTREZID <- mapIds(Homo.sapiens, keys = as.character(Data$Gene.Name), column = c("ENTREZID"), 
    keytype = "SYMBOL", multiVals = "first")

## 'select()' returned 1:many mapping between keys and columns

genes <- data.frame(ENTREZID, SYMBOL = Data$Gene.Name)

# In this file of counts, there are multiple instances where the same gene symbol
# or entrez id appears in multiple rows. This may be due to how it was initially
# generated but as we are not able to distinguish between them (and in most cases
# the count information is very similar for the duplicate rows, we will just ask
# for just one row to be kept for any given gene name)
chooseGenes <- which(!duplicated(genes$SYMBOL) & !duplicated(genes$ENTREZID))

genes <- genes[chooseGenes, ]
Data <- Data[chooseGenes, ]


# MANIPULATE GENE COUNTS

# We do not need the RPKMs just the read counts, so lets just pull out these
# columns
colnames(Data)

##  [1] "Chromosome"                 "Gene.Name"                 
##  [3] "Gene.ID"                    "Sum.of.Exon.Length.of.Gene"
##  [5] "Reads.Counts..BLCL_A1."     "RPKM..BLCL_A1."            
##  [7] "Reads.Counts..BLCL_A2."     "RPKM..BLCL_A2."            
##  [9] "Reads.Counts..BLCL_B1."     "RPKM..BLCL_B1."            
## [11] "Reads.Counts..BLCL_B2."     "RPKM..BLCL_B2."            
## [13] "Reads.Counts..BLCL_C1."     "RPKM..BLCL_C1."            
## [15] "Reads.Counts..BLCL_C2."     "RPKM..BLCL_C2."            
## [17] "Reads.Counts..BLCL_D1."     "RPKM..BLCL_D1."            
## [19] "Reads.Counts..BLCL_D2."     "RPKM..BLCL_D2."            
## [21] "Reads.Counts..BLCL_E1."     "RPKM..BLCL_E1."            
## [23] "Reads.Counts..BLCL_E2."     "RPKM..BLCL_E2."

counts <- Data[, grep("Counts", colnames(Data))]

# LCL are labelled 1, B cell labelled 2. Lets rename the columns to reflect this
colnames(counts) <- sub("Reads.Counts..BLCL_", "", colnames(counts))
colnames(counts) <- sub("1", "LCL", colnames(counts))
colnames(counts) <- sub("2", "CD19B", colnames(counts))
colnames(counts) <- sub("CD19B.", "_CD19B", colnames(counts))
colnames(counts) <- sub("LCL.", "_LCL", colnames(counts))
colnames(counts)

##  [1] "A_LCL"   "A_CD19B" "B_LCL"   "B_CD19B" "C_LCL"   "C_CD19B" "D_LCL"  
##  [8] "D_CD19B" "E_LCL"   "E_CD19B"

# make the Gene IDs the name of the rows
rownames(counts) <- genes$SYMBOL


# MANIPULATE SAMPLE ANNOTATIONS

# This is a function that will be in the next release of tidyverse.  We can use
# it to split a string on a certain pattern and then select things that are
# before or after than split
str_split_n <- function(string, pattern, n) {
    out <- str_split(string, pattern, simplify = TRUE)
    apply(out, 1, `[`, i = n)
}

# use sample names to make factors for group and individual
group <- str_split_n(colnames(counts), "_", 2) %>% factor()
individual <- str_split_n(colnames(counts), "_", 1) %>% factor()

# put this factor information into a data frame
samples <- data.frame(group, individual)
samples

##    group individual
## 1    LCL          A
## 2  CD19B          A
## 3    LCL          B
## 4  CD19B          B
## 5    LCL          C
## 6  CD19B          C
## 7    LCL          D
## 8  CD19B          D
## 9    LCL          E
## 10 CD19B          E

# MAKE DGEList

# make DGEList object to connect gene counts with sample information
BLCL <- DGEList(counts = counts, samples = samples, genes = genes)
BLCL

## An object of class "DGEList"
## $counts
##        A_LCL A_CD19B B_LCL B_CD19B C_LCL C_CD19B D_LCL D_CD19B E_LCL E_CD19B
## A1CF       2       1     1       4     0      11     2      10     1       7
## ABCC2     21      42    10      89    13      63    25     163    54     130
## ABI1    1975    1436   563    2307   695    1235   847    2806   986    2471
## ABLIM1   786    2619   220    4464   400    2600   626    5804   177    4484
## ACADSB   756     543   210     877   264     568   272    1065   302    1039
## 19920 more rows ...
## 
## $samples
##         group lib.size norm.factors individual
## A_LCL     LCL 17028144            1          A
## A_CD19B CD19B  9577596            1          A
## B_LCL     LCL  7531842            1          B
## B_CD19B CD19B 17510653            1          B
## C_LCL     LCL 11548555            1          C
## C_CD19B CD19B  8503027            1          C
## D_LCL     LCL  8927495            1          D
## D_CD19B CD19B 19627791            1          D
## E_LCL     LCL  6724158            1          E
## E_CD19B CD19B 15276274            1          E
## 
## $genes
##        ENTREZID SYMBOL
## A1CF      29974   A1CF
## ABCC2      1244  ABCC2
## ABI1      10006   ABI1
## ABLIM1     3983 ABLIM1
## ACADSB       36 ACADSB
## 19920 more rows ...

Further processing and normalisation of the data

# COUNTS NORMALIZATION

# First we calculate normalization factors
BLCL <- calcNormFactors(BLCL, method = "TMM")
# make a copy of BLCLDGEList to use in extension tab.
BLCL2 <- BLCL

# show the nomalisation factors. For this dataset the effect of TMM-normalisation
# is mild, as evident in the magnitude of the scaling factors, which are all
# relatively close to 1.
BLCL$samples$norm.factors

##  [1] 1.0140834 1.0134316 0.9123651 0.9568436 0.9213837 1.1034365 0.8930159
##  [8] 1.0526869 1.1040319 1.0563174

# Next we will keep genes with large enough counts to be considered useful.
keep <- filterByExpr(BLCL)
BLCL <- BLCL[keep, ]

# Calculate the counts per million reads for each gene for all samples
cpm <- cpm(BLCL)
# now calculate the log of these CPM values
lcpm <- cpm(BLCL, log = TRUE)

Explore data with PCA. Can you see any structure?

# Perform PCA
pca <- prcomp(t(lcpm), scale = TRUE)
# Plot results
autoplot(pca, data = samples, colour = "group", shape = "individual")

# Another method for dimension reduction is called MDS.
plotMDS(lcpm, labels = group)

plotMDS(lcpm, labels = individual)

# I also like this cool interactive version of a MDS plot which can be generated
# using Glimma
glMDSPlot(lcpm, labels = paste(group, individual, sep = "_"), groups = BLCL$samples[, 
    c("group", "individual")], launch = TRUE)

Construct a design matrix and perform variance stabilisation.

limma is a versatile package for DE analysis of microarray data. However, the count based data from RNAseq has certain stastical properties that are different to microarray based data, namely that in RNAseq count data, the variance is not independent of the mean. The authors of limma delevoped the voom function which accounts for this and modifies the RNAseq count data for use with the limma package.

# We first need to set up our design matrix, where we tell limma that we are
# interested in fitting the gene expression data against group, but also to
# include the 'individual' term, as this will account for the samples being
# paired.
design <- model.matrix(~individual + group)
design

##    (Intercept) individualB individualC individualD individualE groupLCL
## 1            1           0           0           0           0        1
## 2            1           0           0           0           0        0
## 3            1           1           0           0           0        1
## 4            1           1           0           0           0        0
## 5            1           0           1           0           0        1
## 6            1           0           1           0           0        0
## 7            1           0           0           1           0        1
## 8            1           0           0           1           0        0
## 9            1           0           0           0           1        1
## 10           1           0           0           0           1        0
## attr(,"assign")
## [1] 0 1 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$individual
## [1] "contr.treatment"
## 
## attr(,"contrasts")$group
## [1] "contr.treatment"

# Voom starts with the count data, transforms it into logCPM, estimates the
# mean-variance relationship, and uses this to compute appropriate
# observation-level weights. The data can then be used for linear modelling in
# the same way as microarray data. voom generates an EList object, which is
# similar to the DGEList-object
v <- voom(BLCL, design, plot = TRUE)

# The voom plot shows that whilst not strong, there is some relationship between
# variance and expression level in this data.
v

## An object of class "EList"
## $genes
##        ENTREZID SYMBOL
## ABCC2      1244  ABCC2
## ABI1      10006   ABI1
## ABLIM1     3983 ABLIM1
## ACADSB       36 ACADSB
## ACBD5     91452  ACBD5
## 12170 more rows ...
## 
## $targets
##         group lib.size norm.factors individual
## A_LCL     LCL 17267958    1.0140834          A
## A_CD19B CD19B  9706239    1.0134316          A
## B_LCL     LCL  6871789    0.9123651          B
## B_CD19B CD19B 16754956    0.9568436          B
## C_LCL     LCL 10640650    0.9213837          C
## C_CD19B CD19B  9382550    1.1034365          C
## D_LCL     LCL  7972395    0.8930159          D
## D_CD19B CD19B 20661918    1.0526869          D
## E_LCL     LCL  7423685    1.1040319          E
## E_CD19B CD19B 16136594    1.0563174          E
## 
## $E
##            A_LCL  A_CD19B     B_LCL  B_CD19B    C_LCL  C_CD19B    D_LCL
## ABCC2  0.3162391 2.130478 0.6116314 2.417300 0.343373 2.758704 1.677412
## ABI1   6.8379765 7.209430 6.3575858 7.105599 6.030392 7.040899 6.732056
## ABLIM1 5.5092773 8.076163 5.0039488 8.057767 5.234144 8.114593 6.296157
## ACADSB 5.4531706 5.807224 4.9369904 5.710739 4.635609 5.921036 5.095099
## ACBD5  4.9547171 5.440476 4.8667724 5.331159 4.838784 5.714000 4.973653
##         D_CD19B    E_LCL  E_CD19B
## ABCC2  2.984244 2.876049 3.015642
## ABI1   7.085654 7.054040 7.258907
## ABLIM1 8.134054 4.579540 8.118467
## ACADSB 5.688413 5.348656 6.009410
## ACBD5  5.785283 5.400186 5.852696
## 12170 more rows ...
## 
## $weights
##          [,1]     [,2]     [,3]      [,4]      [,5]     [,6]      [,7]     [,8]
## [1,] 2.331025 2.979675 1.652737  4.488719  2.053346 3.313986  2.324869 6.708812
## [2,] 9.358623 9.575208 9.816939  9.169044 10.018719 9.898932 10.011455 8.613684
## [3,] 9.983437 8.720541 7.722519  8.095354  9.196385 8.921615  9.370692 7.169613
## [4,] 9.908408 9.840943 7.327472 10.007666  8.542400 9.442735  7.924945 9.842946
## [5,] 9.590230 9.318834 6.969832  9.963088  8.654301 9.360818  8.021923 9.885127
##           [,9]    [,10]
## [1,]  2.802607 7.258436
## [2,] 10.004491 8.747525
## [3,]  7.572132 8.368332
## [4,]  8.307608 9.884701
## [5,]  8.318522 9.944736
## 12170 more rows ...
## 
## $design
##    (Intercept) individualB individualC individualD individualE groupLCL
## 1            1           0           0           0           0        1
## 2            1           0           0           0           0        0
## 3            1           1           0           0           0        1
## 4            1           1           0           0           0        0
## 5            1           0           1           0           0        1
## 6            1           0           1           0           0        0
## 7            1           0           0           1           0        1
## 8            1           0           0           1           0        0
## 9            1           0           0           0           1        1
## 10           1           0           0           0           1        0
## attr(,"assign")
## [1] 0 1 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$individual
## [1] "contr.treatment"
## 
## attr(,"contrasts")$group
## [1] "contr.treatment"

Fit linear model

# Linear modelling in limma is carried out using the 'lmFit' and 'contrasts.fit'
# functions.  These functions fit a separate model to the expression values for
# each gene. We now tell limma to fit the model specified in our design matrix
# for the voom transformed data
fit <- lmFit(v, design)

Generate moderated test statistics

# Next, empirical Bayes moderation is carried out by borrowing information across
# all genes to obtain more precise estimates of gene-wise variability.
efit <- eBayes(fit)

Explore the results

# Quick look at how many genes are differentially expressed
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)

##        (Intercept) individualB individualC individualD individualE groupLCL
## Down           354           1           0           1          10     4220
## NotSig         947       12173       12169       12172       12156     3613
## Up           10874           1           6           2           9     4342

# The coefficient we are interested in is 'groupLCL'.  Does the number of DE
# genes change if we instead use the more stringent bonferroni correction?
dt_bonf <- decideTests(efit, adjust.method = "bonferroni", p.value = 0.05)
summary(dt_bonf)

##        (Intercept) individualB individualC individualD individualE groupLCL
## Down            34           1           0           1           5      799
## NotSig        3114       12174       12173       12173       12168    10140
## Up            9027           0           2           1           2     1236

# histogram of p value distribution for groupLCL colnames(efit)
hist(efit$p.value[, "groupLCL"])

# volcano plot of results for groupLCL
volcanoplot(efit, coef = "groupLCL", highlight = 10, names = efit$genes$SYMBOL)

# Generate a table of the genes significantly differentially expressed in LCLs v
# B cells (groupLCL)
DELCLvB <- topTable(efit, coef = "groupLCL", adjust.method = "fdr", p.value = 0.05, 
    number = Inf)

# we can write this table to a file
write.csv(DELCLvB, file = "DELCLvB.csv")

# Generate a mean-difference plot (MD-plot) for the group comparison (groupLCL,
# which is column 6 of the efit object)
useCoef <- which(colnames(efit) == "groupLCL")

plotMD(efit, column = useCoef, status = dt_fdr[, useCoef], main = colnames(efit)[useCoef], 
    )

# we can also generate an interactive MDplot using Glimma
glMDPlot(efit, coef = useCoef, status = dt_fdr[, useCoef], main = colnames(efit)[useCoef], 
    side.main = "ENTREZID", counts = lcpm, groups = group, launch = TRUE)

Create a gene expression heatmap

# Get a list of the top 40 most DE genes. We will use this to draw a heatmap
LCLvB.topgenes <- DELCLvB$SYMBOL[1:40]

# pull out the voom expression data for these 40 genes
i <- which(v$genes$SYMBOL %in% LCLvB.topgenes)
coolmap(lcpm[i, ], margin = c(8, 6), cexRow = 0.7, cexCol = 0.7)

# You could save this with the pdf() function.
pdf("heatmap.pdf", width = 12, height = 16)
coolmap(lcpm[i, ], margin = c(8, 6))
dev.off()

## png 
##   2

# You might like to look at the library pheatmap

Draw a plot of a gene of interest from the data

CD40 <- lcpm["CD40", ]
df <- data.frame(CD40, group, individual)
ggplot(df, aes(x = group, y = CD40, colour = individual)) + geom_point() + geom_line(aes(group = individual)) + 
    theme_classic()

2.2 Extension: Influenza pneumonia dataset

The following data comes from a manuscript “A distinct influenza infection signature in the blood transcriptome of patients with severe community-acquired pneumonia”.https://www.ncbi.nlm.nih.gov/pubmed/22898401 .

Diagnosis of severe influenza pneumonia remains challenging because of a lack of correlation between the presence of influenza virus and clinical status. Gene-expression profiling was conducted on the whole blood of critically ill patients to identify a gene signature that would allow clinicians to distinguish influenza infection from other causes of severe respiratory failure, such as bacterial pneumonia, and noninfective systemic inflammatory response syndrome. Samples were assayed on Illumina HT-12 gene-expression beadarrays.

Read in the data.

# Retrieve dataset from GEO. getGEO gives you a list of expressionsets, as we
# only have one, we might as well just pull out that one (if (length(gset) > 1)
# idx <- grep('GPL6947', attr(gset, 'names')) else idx <- 1 | gset <-
# gset[[idx]])

dir.create("GSE40012")

## Warning in dir.create("GSE40012"): 'GSE40012' already exists

gset <- getGEO("GSE40012", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = "GSE40012")[[1]]

# Have a look at the summary info for the ExpressionSet
gset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 48754 features, 190 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM983403 GSM983404 ... GSM983592 (190 total)
##   varLabels: title geo_accession ... tissue:ch1 (41 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1343291 ILMN_1343295 ... ILMN_2416019 (48754
##     total)
##   fvarLabels: ID Gene title ... Platform_SEQUENCE (22 total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 22898401 
## Annotation: GPL6947

# This dataset has many variable labels (including phenotype information). The
# phenotype data in an expressionset is stored in pData. Let's use the phenotype
# data to pull out just the day 1 samples
pheno <- pData(gset)
colnames(pData(gset))

##  [1] "title"                   "geo_accession"          
##  [3] "status"                  "submission_date"        
##  [5] "last_update_date"        "type"                   
##  [7] "channel_count"           "source_name_ch1"        
##  [9] "organism_ch1"            "characteristics_ch1"    
## [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
## [13] "characteristics_ch1.3"   "characteristics_ch1.4"  
## [15] "molecule_ch1"            "extract_protocol_ch1"   
## [17] "label_ch1"               "label_protocol_ch1"     
## [19] "taxid_ch1"               "hyb_protocol"           
## [21] "scan_protocol"           "description"            
## [23] "data_processing"         "platform_id"            
## [25] "contact_name"            "contact_email"          
## [27] "contact_laboratory"      "contact_department"     
## [29] "contact_institute"       "contact_address"        
## [31] "contact_city"            "contact_state"          
## [33] "contact_zip/postal_code" "contact_country"        
## [35] "supplementary_file"      "data_row_count"         
## [37] "day:ch1"                 "gender:ch1"             
## [39] "ID:ch1"                  "sample type:ch1"        
## [41] "tissue:ch1"

D1 <- gset[, pheno$`day:ch1` == "1"]

# Phenotype data
pheno <- pData(D1)

# The expression values are stored within 'exprs'. For this dataset the
# expression values have already been normalised using Quantile normalisation,
# but they have not been log transformed. So we will log transform the expression
# values in D1.
exprs(D1) <- log2(exprs(D1))
expVal <- exprs(D1)

# Each probe of the microarray has detailed annotation. To see the specific
# annotation levels we can type fData. Currently the rownames of our expression
# data is the Illumina probe id. This is not convenient for pulling down a gene
# symbol of interest. More gene ID types can be accessed with fData
geneAnno <- fData(D1)

# I wish to shorten the sample types to say bact, flu, mixed, SIRS, HC, As they
# are they are too long to easily display on heatmaps etc.
sampleType <- pheno$`sample type:ch1`
shortNames <- sub("healthy control", "HC", sampleType)
shortNames <- sub("systemic inflammatory response", "SIRS", shortNames)
shortNames <- sub("mixed bacterial and influenza A pneumonia", "mixed", shortNames)
shortNames <- sub("influenza A pneumonia", "flu", shortNames)
shortNames <- sub("bacterial pneumonia", "bact", shortNames)
shortNames

##  [1] "SIRS"  "bact"  "bact"  "SIRS"  "bact"  "bact"  "flu"   "flu"   "bact" 
## [10] "flu"   "flu"   "SIRS"  "SIRS"  "bact"  "HC"    "HC"    "HC"    "HC"   
## [19] "HC"    "HC"    "HC"    "HC"    "HC"    "HC"    "HC"    "HC"    "HC"   
## [28] "HC"    "HC"    "HC"    "HC"    "HC"    "mixed" "SIRS"  "SIRS"  "SIRS" 
## [37] "bact"  "bact"  "bact"  "SIRS"  "flu"   "flu"   "SIRS"  "bact"  "flu"  
## [46] "mixed" "bact"  "bact"  "bact"  "SIRS"  "SIRS"  "flu"   "bact"  "bact" 
## [55] "SIRS"  "mixed" "bact"

# store this phenotype information as a data frame
pheno$group <- shortNames

Explore the data with PCA

# Lets take a first look at the dataset. We can perform a PCA analysis.
useGenes <- rowMeans(expVal) > 7
pca <- prcomp(t(expVal[useGenes, ]), scale = TRUE)
autoplot(pca, pheno, colour = "group", size = 2)

autoplot(pca, x = 3, y = 4, pheno, colour = "group", size = 2)

# PCA extension!!! Start again and perform a PCA on the dataset that has up to 5
# days for each patient

Construct a design matrix

# As we have simplified our dataset our analysis model is fairly simple, just
# comparing between the 2 groups
design <- model.matrix(~0 + group, pheno)
design

##           groupbact groupflu groupHC groupmixed groupSIRS
## GSM983405         0        0       0          0         1
## GSM983407         1        0       0          0         0
## GSM983416         1        0       0          0         0
## GSM983418         0        0       0          0         1
## GSM983423         1        0       0          0         0
## GSM983428         1        0       0          0         0
## GSM983429         0        1       0          0         0
## GSM983434         0        1       0          0         0
## GSM983439         1        0       0          0         0
## GSM983441         0        1       0          0         0
## GSM983445         0        1       0          0         0
## GSM983452         0        0       0          0         1
## GSM983455         0        0       0          0         1
## GSM983456         1        0       0          0         0
## GSM983459         0        0       1          0         0
## GSM983461         0        0       1          0         0
## GSM983463         0        0       1          0         0
## GSM983465         0        0       1          0         0
## GSM983467         0        0       1          0         0
## GSM983469         0        0       1          0         0
## GSM983471         0        0       1          0         0
## GSM983473         0        0       1          0         0
## GSM983475         0        0       1          0         0
## GSM983477         0        0       1          0         0
## GSM983479         0        0       1          0         0
## GSM983481         0        0       1          0         0
## GSM983483         0        0       1          0         0
## GSM983485         0        0       1          0         0
## GSM983487         0        0       1          0         0
## GSM983489         0        0       1          0         0
## GSM983491         0        0       1          0         0
## GSM983493         0        0       1          0         0
## GSM983497         0        0       0          1         0
## GSM983501         0        0       0          0         1
## GSM983505         0        0       0          0         1
## GSM983508         0        0       0          0         1
## GSM983510         1        0       0          0         0
## GSM983514         1        0       0          0         0
## GSM983515         1        0       0          0         0
## GSM983518         0        0       0          0         1
## GSM983524         0        1       0          0         0
## GSM983529         0        1       0          0         0
## GSM983534         0        0       0          0         1
## GSM983539         1        0       0          0         0
## GSM983544         0        1       0          0         0
## GSM983547         0        0       0          1         0
## GSM983551         1        0       0          0         0
## GSM983555         1        0       0          0         0
## GSM983557         1        0       0          0         0
## GSM983562         0        0       0          0         1
## GSM983570         0        0       0          0         1
## GSM983574         0        1       0          0         0
## GSM983577         1        0       0          0         0
## GSM983578         1        0       0          0         0
## GSM983583         0        0       0          0         1
## GSM983585         0        0       0          1         0
## GSM983590         1        0       0          0         0
## attr(,"assign")
## [1] 1 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

colnames(design) <- sub("group", "", colnames(design))

cont.matrix <- makeContrasts(fluvsbact = flu - bact, levels = design)
cont.matrix

##        Contrasts
## Levels  fluvsbact
##   bact         -1
##   flu           1
##   HC            0
##   mixed         0
##   SIRS          0

Fit linear model using limma

# fit the model
fit <- lmFit(D1, design)

# fit the model for the specified contrasts. For us this is only an option of 1,
# but if we had more than 2 groups here we could define what groups to compare

fit2 <- contrasts.fit(fit, cont.matrix)

# Create moderated test statistics using eBayes
efit <- eBayes(fit2)

Explore results

# Quick look at how many genes are differentially expressed
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)

##        fluvsbact
## Down         612
## NotSig     46881
## Up          1261

# How does the number of DE change if we instead use the more stringent
# bonferroni correction?
dt_bonf <- decideTests(efit, adjust.method = "bonferroni", p.value = 0.05)
summary(dt_bonf)

##        fluvsbact
## Down          18
## NotSig     48599
## Up           137

# histogram of p value distribution for groupLCL
hist(efit$p.value[, "fluvsbact"])

# volcano plot
volcanoplot(efit, coef = "fluvsbact", highlight = 10, names = geneAnno$`Gene symbol`)

# Generate a table of the genes significantly differentially expressed in
# fluvsbact
DE_fluvsbact <- topTable(efit, coef = "fluvsbact", adjust.method = "fdr", p.value = 0.05, 
    number = Inf)
# we can write this table to a file
write.csv(DE_fluvsbact, file = "fluvsbact.csv")

# Generate a mean-difference plot (MD-plot)
colnames(efit)

## [1] "fluvsbact"

glMDPlot(efit, coef = 1, status = dt_bonf[, 1], main = colnames(efit)[1], side.main = "UNIQUEID", 
    counts = expVal, groups = pheno$group, launch = TRUE)

Create a gene expression heatmap

# Get a list of the top 40 most DE genes. We will use this to draw a heatmap
topGenes <- rownames(DE_fluvsbact)[1:40]

# pull out the expression data for these 40 genes
i <- which(rownames(D1) %in% topGenes)
coolmap(expVal[i, ], labCol = pheno$group, labRow = DE_fluvsbact$Gene.symbol, margin = c(8, 
    6), cexRow = 0.7, cexCol = 0.7)

Draw a plot of a gene of interest from the data

# Now we can pull down our gene of interest by its gene symbol.
useGene <- rownames(expVal)[which(geneAnno$`Gene symbol` == "IFI27")]
df <- data.frame(pheno, IFI27 = expVal[useGene, ])
ggplot(df, aes(x = group, y = IFI27, colour = group)) + geom_boxplot() + geom_point() + 
    theme_classic()

2.3 Extension 2 : RNA-seq 1-2-3 protocol

You may wish to browse the RNA-seq 1-2-3 protocol published in F1000Research which performs an RNAseq analysis on different cell populations harvested from mice mammary gland tissue. You could work through the code in RStudio step by step for extra practice. Note that their workflow will proceed all the way to differential gene expression testing and pathway analysis which we will not visit until weeks 8 and 9. See https://f1000research.com/articles/5-1408 .

3 Lab report

Only this section needs to be included in your report. This report will be due at the end of Week 9, Friday 11:59pm 7th May.

While I expect that you should explore the TCGA melanoma data, I am not opposed to you submiting a report on a dataset that you are incredibly engaged with.

Report purpose

The purpose of this report is to:

Provide you with an opportunity to apply the methods and skills you have been exposed to in class.
Create a document that can aid you when asking for feedback from tutors.
Practice creating reports for your mid-semester exam.

Because of this, do not feel restricted to using the dataset I have provided. Feel free to download any dataset that you find interesting and apply what you have learnt in class.

Report guidelines

There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:

Think about how your report might be structured (eg. Summary, Introduction, Results, Conclusion).
Do your figures have captions?
Are your figures legible?
Are you using sub-headings effectively?

It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”

Marking criteria

0.5 mark - Exploratory data analysis, try a PCA plot.
0.5 mark - Perform differential expression analysis.
0.5 mark - Perform pathway analysis, what are the most significant pathways?
0.5 mark - Comment – Make a reasonable effort to comment and interpret your findings.
0.5 mark - Presentation – Rmarkdown, no extra R output, use of headings.

Lab instructions

3.1 TCGA Melanoma

There are at least 200 forms of cancer, and many more subtypes. Each of these is caused by errors in DNA that cause cells to grow uncontrolled. Identifying the changes in each cancer’s complete set of DNA - its genome - and understanding how such changes interact to drive the disease will lay the foundation for improving cancer prevention, early detection and treatment.

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, 2.5 petabytes of data describing tumor tissue and matched normal tissues from more than 11,000 patients, is publically available and has been used widely by the research community. The data have contributed to more than a thousand studies of cancer by independent researchers and to the TCGA research network publications.

This data can be downloaded from the data portal in TCGA (https://portal.gdc.cancer.gov/) but is in an aweful format. Thankfully, the team at Recount2 https://jhubiostatistics.shinyapps.io/recount/ have reprocessed all of the TCGA data. In the following, I have code to download the TCGA melanoma data. Many of these code lines only need to be run once.

3.2 Questions?

Read in the data and experiment with normalization.
Can you see any structure in a PCA?
Is this structure explained by any variables in the clinical data?

3.3 Read files that I processed

Let’s start by reading in the data.

### Install recount from Bioconductor, do this once only!!!!
### install.packages('BiocManager') BiocManager::install('recount')

library(recount)
library(dplyr)
### Browse the vignetets for a quick description of how to use the package
### browseVignettes('recount')

### Download the RangedSummarizedExperiment object at the gene level for TCGA
### melanoma data.  url <-
### 'http://duffel.rail.bio/recount/v2/TCGA/rse_gene_skin.Rdata' View the url for
### the file by printing the object url url Download the file download.file(url =
### url, destfile = 'rse_gene_skin.Rdata', method = 'auto', mode = 'wb')

### Load the data
load("rse_gene_skin.Rdata")

### rse_gene is a RangesSummarizedExperiment
rse_gene

### Gene counts
counts <- assay(rse_gene)

### Gene identifiers
genes <- rowData(rse_gene)

### Phenotype data for the samples
samples <- as.data.frame(colData(rse_gene))

### There are 864 variables in samples, 593 are completely missing
samples %>% is.na() %>% colSums() %>% table()

AMED3002 Lab 8

Differential gene expression