Aims: 
 Perform an over-representation test 
 Perform a pathway analysis


1 Preparation for lab

Please make an effort to complete your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations. Try to use your expanding knowledge of R to modify this example code in places where you think it can be altered/improved. Ultimately, this workflow will go from obtaining data from GEO through to pathway analysis, however we will complete the workflow in 3 stages. In week 7 we will focus on obtaining the data and getting through to PCA analysis. In week 8 we will perform differential gene expression analysis and some visualisations, and in week 9 we will get to the pathway analysis. The workflow uses popular bioconductor packages and is adapted from the published RNAseq 1-2-3 workflow (https://f1000research.com/articles/5-1408).

2 Data exercises

2.1 B cells and EBV infected B cells (LCLs)

The following data comes from a manuscript “Evidence from genome wide association studies implicates reduced control of Epstein-Barr virus infection in multiple sclerosis susceptibility” (https://www.ncbi.nlm.nih.gov/pubmed/31039804). There were multiple goals for the RNAseq component of project:

  • Identify the gene expression changes that occur when B cells are infected with EBV and transformed into LCLs.

  • Determine which pathways are over-represented in the differentially expressed genes.

  • Assess if there are more MS risk genes then expected by chance in the differentially expressed genes.

Use this dataset as an opportunity to explore the behaviour of a small RNA-Seq experiment. In this lab we will start by just processing and exploring the data.

IMPORTANT: The code for week 9 continues on from the code generated in weeks 7 and 8. The new steps introduced this week start from step 11.

1A. Install / load required R packages

library(tidyverse)
library(ggfortify)
library(RColorBrewer)



# If any packages of these required are not installed, they can be installed by
# first installing bioconductor: if (!requireNamespace('BiocManager', quietly =
# TRUE)) install.packages('BiocManager') then using
# BiocManager::install('PACKAGENAME')

# load packages to be used in the analysis
library(GEOquery)
library(org.Hs.eg.db)
library(limma)
library(edgeR)
library(Glimma)
library(Homo.sapiens)
library(gplots)
library(dplyr)

1B. Read in the data.

# The count data for this GEO submission is available as a supplementary file
# on GEO. You only want to run this once...

# sfiles <- getGEOSuppFiles('GSE126379') fnames <- rownames(sfiles)

# There is only one supplemental file Data = read.delim(fnames[1],header=TRUE)
# head(Data)

# run this line if you get a VROOM error
Sys.setenv(VROOM_CONNECTION_SIZE = 131072 * 2)

# Next time you can just run...
fnames <- "GSE126379/GSE126379_BLCL_processed.txt.gz"
Data = read.delim(fnames[1], header = TRUE)
head(Data)
##   Chromosome Gene.Name Gene.ID Sum.of.Exon.Length.of.Gene
## 1      chr10      A1CF    A1CF                       9529
## 2      chr10     ABCC2   ABCC2                       5051
## 3      chr10      ABI1    ABI1                       3776
## 4      chr10    ABLIM1  ABLIM1                       8303
## 5      chr10    ACADSB  ACADSB                       5941
## 6      chr10     ACBD5   ACBD5                       4066
##   Reads.Counts..BLCL_A1. RPKM..BLCL_A1. Reads.Counts..BLCL_A2. RPKM..BLCL_A2.
## 1                      2      0.0113048                      1    0.009967274
## 2                     21      0.2239352                     42    0.789760929
## 3                   1975     28.1718559                   1436   36.119869850
## 4                    786      5.0987985                   2619   29.958784090
## 5                    756      6.8539761                    543    8.680884936
## 6                    535      7.0870705                    421    9.834186588
##   Reads.Counts..BLCL_B1. RPKM..BLCL_B1. Reads.Counts..BLCL_B2. RPKM..BLCL_B2.
## 1                      1     0.01277064                      4     0.02185651
## 2                     10     0.24092549                     89     0.91744656
## 3                    563    18.14414625                   2307    31.81147083
## 4                    220     3.22439153                   4464    27.99350563
## 5                    210     4.30149944                    877     7.68613807
## 6                    200     5.98580756                    674     8.63099098
##   Reads.Counts..BLCL_C1. RPKM..BLCL_C1. Reads.Counts..BLCL_C2. RPKM..BLCL_C2.
## 1                      0      0.0000000                     11       0.122567
## 2                     13      0.2041644                     63       1.324315
## 3                    695     14.6004734                   1235      34.726663
## 4                    400      3.8215458                   2600      33.248066
## 5                    264      3.5249949                    568      10.151188
## 6                    304      5.9308963                    492      12.847712
##   Reads.Counts..BLCL_D1. RPKM..BLCL_D1. Reads.Counts..BLCL_D2. RPKM..BLCL_D2.
## 1                      2     0.02147553                     10     0.04834247
## 2                     25     0.50643522                    163     1.48657342
## 3                    847    22.95158503                   2806    34.23196323
## 4                    626     7.71437158                   5804    32.20093570
## 5                    272     4.68457946                   1065     8.25783462
## 6                    250     6.29120581                   1139    12.90424131
##   Reads.Counts..BLCL_E1. RPKM..BLCL_E1. Reads.Counts..BLCL_E2. RPKM..BLCL_E2.
## 1                      1     0.01387007                      7     0.04338079
## 2                     54     1.41300065                    130     1.51989185
## 3                    986    34.51206142                   2471    38.64447949
## 4                    177     2.81750179                   4484    31.89167049
## 5                    302     6.71851610                   1039    10.32767820
## 6                    313    10.17426241                    932    13.53615256
  1. Tidy up the dataset and set it up for downstream analysis
# MANIPULATE GENE ANNOTATIONS

# Genes have multiple 'identifiers' in different databases. Later on we will
# need identifiers from the ENTREZ database. To save us some pain later, lets
# extract these now using the select() function from the AnnotationDbi package.
# The ENTREZ identifiers can be found in the Homo.sapiens package.
ENTREZID <- mapIds(Homo.sapiens, keys = as.character(Data$Gene.Name), column = c("ENTREZID"),
    keytype = "SYMBOL", multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
genes <- data.frame(ENTREZID, SYMBOL = Data$Gene.Name)

# In this file of counts, there are multiple instances where the same gene
# symbol or entrez id appears in multiple rows. This may be due to how it was
# initially generated but as we are not able to distinguish between them (and
# in most cases the count information is very similar for the duplicate rows,
# we will just ask for just one row to be kept for any given gene name)
chooseGenes <- which(!duplicated(genes$SYMBOL) & !duplicated(genes$ENTREZID))

genes <- genes[chooseGenes, ]
Data <- Data[chooseGenes, ]


# MANIPULATE GENE COUNTS

# We do not need the RPKMs just the read counts, so lets just pull out these
# columns colnames(Data)
counts <- Data[, grep("Counts", colnames(Data))]

# LCL are labelled 1, B cell labelled 2. Lets rename the columns to reflect
# this
colnames(counts) <- sub("Reads.Counts..BLCL_", "", colnames(counts))
colnames(counts) <- sub("1", "LCL", colnames(counts))
colnames(counts) <- sub("2", "CD19B", colnames(counts))
colnames(counts) <- sub("CD19B.", "_CD19B", colnames(counts))
colnames(counts) <- sub("LCL.", "_LCL", colnames(counts))
colnames(counts)
##  [1] "A_LCL"   "A_CD19B" "B_LCL"   "B_CD19B" "C_LCL"   "C_CD19B" "D_LCL"  
##  [8] "D_CD19B" "E_LCL"   "E_CD19B"
# make the Gene IDs the name of the rows
rownames(counts) <- genes$SYMBOL


# MANIPULATE SAMPLE ANNOTATIONS

# This is a function that will be in the next release of tidyverse.  We can use
# it to split a string on a certain pattern and then select things that are
# before or after than split
str_split_n <- function(string, pattern, n) {
    out <- str_split(string, pattern, simplify = TRUE)
    apply(out, 1, `[`, i = n)
}

# use sample names to make factors for group and individual
group <- str_split_n(colnames(counts), "_", 2) %>%
    factor()
individual <- str_split_n(colnames(counts), "_", 1) %>%
    factor()

# put this factor information into a data frame
samples <- data.frame(group, individual)
samples
##    group individual
## 1    LCL          A
## 2  CD19B          A
## 3    LCL          B
## 4  CD19B          B
## 5    LCL          C
## 6  CD19B          C
## 7    LCL          D
## 8  CD19B          D
## 9    LCL          E
## 10 CD19B          E
# MAKE DGEList

# make DGEList object to connect gene counts with sample information
BLCL <- DGEList(counts = counts, samples = samples, genes = genes)
# BLCL
  1. Further processing and normalisation of the data
# COUNTS NORMALIZATION

# First we calculate normalization factors
BLCL <- calcNormFactors(BLCL, method = "TMM")
# make a copy of BLCLDGEList to use in extension tab.
BLCL2 <- BLCL

# show the nomalisation factors. For this dataset the effect of
# TMM-normalisation is mild, as evident in the magnitude of the scaling
# factors, which are all relatively close to 1.
BLCL$samples$norm.factors
##  [1] 1.0160957 1.0116719 0.9145425 0.9550419 0.9229546 1.0997006 0.8932697
##  [8] 1.0528178 1.1069495 1.0540979
# Next we will keep genes with large enough counts to be considered useful.
keep <- filterByExpr(BLCL)
BLCL <- BLCL[keep, ]

# Calculate the counts per million reads for each gene for all samples
cpm <- cpm(BLCL)
# now calculate the log of these CPM values
lcpm <- cpm(BLCL, log = TRUE)
  1. Explore data with PCA. Can you see any structure?
# Perform PCA
pca <- prcomp(t(lcpm), scale = TRUE)
# Plot results
autoplot(pca, data = samples, colour = "group", shape = "individual")

# Another method for dimension reduction is called MDS.
plotMDS(lcpm, labels = group)

plotMDS(lcpm, labels = individual)

# I also like this cool interactive version of a MDS plot which can be
# generated using Glimma
glMDSPlot(lcpm, labels = paste(group, individual, sep = "_"), groups = BLCL$samples[,
    c("group", "individual")], launch = TRUE)
  1. Construct a design matrix and perform variance stabilisation

limma is a versatile package for DE analysis of microarray data. However, the count based data from RNAseq has certain stastical properties that are different to microarray based data, namely that in RNAseq count data, the variance is not independent of the mean. The authors of limma delevoped the voom function which accounts for this and modifies the RNAseq count data for use with the limma package.

# We first need to set up our design matrix, where we tell limma that we are
# interested in fitting the gene expression data against group, but also to
# include the 'individual' term, as this will account for the samples being
# paired.
design <- model.matrix(~individual + group)
design
##    (Intercept) individualB individualC individualD individualE groupLCL
## 1            1           0           0           0           0        1
## 2            1           0           0           0           0        0
## 3            1           1           0           0           0        1
## 4            1           1           0           0           0        0
## 5            1           0           1           0           0        1
## 6            1           0           1           0           0        0
## 7            1           0           0           1           0        1
## 8            1           0           0           1           0        0
## 9            1           0           0           0           1        1
## 10           1           0           0           0           1        0
## attr(,"assign")
## [1] 0 1 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$individual
## [1] "contr.treatment"
## 
## attr(,"contrasts")$group
## [1] "contr.treatment"
# Voom starts with the count data, transforms it into logCPM, estimates the
# mean-variance relationship, and uses this to compute appropriate
# observation-level weights. The data can then be used for linear modelling in
# the same way as microarray data. voom generates an EList object, which is
# similar to the DGEList-object
v <- voom(BLCL, design, plot = TRUE)

# The voom plot shows that whilst not strong, there is some relationship
# between variance and expression level in this data. v
  1. Fit linear model
# Linear modelling in limma is carried out using the 'lmFit' and
# 'contrasts.fit' functions.  These functions fit a separate model to the
# expression values for each gene. We now tell limma to fit the model specified
# in our design matrix for the voom transformed data
fit <- lmFit(v, design)
  1. Generate moderated test statistics
# Next, empirical Bayes moderation is carried out by borrowing information
# across all genes to obtain more precise estimates of gene-wise variability.
efit <- eBayes(fit)
  1. Explore the results
# Quick look at how many genes are differentially expressed
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)
##        (Intercept) individualB individualC individualD individualE groupLCL
## Down           346           1           0           1          10     4140
## NotSig         920       11945       11941       11944       11928     3551
## Up           10681           1           6           2           9     4256
# The coefficient we are interested in is 'groupLCL'.  Does the number of DE
# genes change if we instead use the more stringent bonferroni correction?
dt_bonf <- decideTests(efit, adjust.method = "bonferroni", p.value = 0.05)
summary(dt_bonf)
##        (Intercept) individualB individualC individualD individualE groupLCL
## Down            31           1           0           1           5      787
## NotSig        3026       11946       11945       11945       11940     9950
## Up            8890           0           2           1           2     1210
# histogram of p value distribution for groupLCL colnames(efit)
hist(efit$p.value[, "groupLCL"])

# volcano plot of results for groupLCL
volcanoplot(efit, coef = "groupLCL", highlight = 10, names = efit$genes$SYMBOL)

# Generate a table of the genes significantly differentially expressed in LCLs
# v B cells (groupLCL)
DELCLvB <- topTable(efit, coef = "groupLCL", adjust.method = "fdr", p.value = 0.05,
    number = Inf)

# we can write this table to a file
write.csv(DELCLvB, file = "DELCLvB.csv")

# Generate a mean-difference plot (MD-plot) for the group comparison (groupLCL,
# which is column 6 of the efit object)
useCoef <- which(colnames(efit) == "groupLCL")

plotMD(efit, column = useCoef, status = dt_fdr[, useCoef], main = colnames(efit)[useCoef],
    )

# we can also generate an interactive MDplot using Glimma
glMDPlot(efit, coef = useCoef, status = dt_fdr[, useCoef], main = colnames(efit)[useCoef],
    side.main = "ENTREZID", counts = lcpm, groups = group, launch = TRUE)
  1. Create a gene expression heatmap
# Get a list of the top 40 most DE genes. We will use this to draw a heatmap
LCLvB.topgenes <- DELCLvB$SYMBOL[1:40]

# pull out the voom expression data for these 40 genes
i <- which(v$genes$SYMBOL %in% LCLvB.topgenes)
coolmap(lcpm[i, ], margin = c(8, 6), cexRow = 0.7, cexCol = 0.7)

# You could save this with the pdf() function.
pdf("heatmap.pdf", width = 12, height = 16)
coolmap(lcpm[i, ], margin = c(8, 6))
dev.off()
## png 
##   2
# You might like to look at the library pheatmap
  1. Draw a plot of a gene of interest from the data
CD40 <- lcpm["CD40", ]
df <- data.frame(CD40, group, individual)
ggplot(df, aes(x = group, y = CD40, colour = individual)) + geom_point() + geom_line(aes(group = individual)) +
    theme_classic()

  1. Get a more conservative list of DE genes

Instead of using a bonferroni corrected threshold p value of 0.05, let’s reduce this to 0.01. We can also restrict the list of genes to those that are at least 4 fold different between groups (lfc = 2).

strictDELCLvB <- topTable(efit, coef = "groupLCL", adjust.method = "bonferroni",
    lfc = 2, p.value = 0.01, number = Inf)

nrow(strictDELCLvB)
## [1] 1218
# Sort this conservative list of DE genes by Fold Change
sortedDELCLvB <- strictDELCLvB[order(-strictDELCLvB$logFC), ]
head(strictDELCLvB)
##         ENTREZID  SYMBOL     logFC  AveExpr         t      P.Value    adj.P.Val
## SRGN        5552    SRGN  5.462839 7.354266  25.54384 2.665280e-11 3.184210e-07
## SLAMF7     57823  SLAMF7  4.943118 6.946977  24.83017 3.644755e-11 4.354388e-07
## MAP3K1      4214  MAP3K1 -5.173016 6.666877 -24.37871 4.462914e-11 5.331843e-07
## BACH2      60468   BACH2 -6.830356 5.060456 -24.91973 3.502721e-11 4.184701e-07
## RASGRP2    10235 RASGRP2 -7.621070 5.353357 -24.36306 4.494649e-11 5.369757e-07
## MTHFD2     10797  MTHFD2  4.737028 5.693648  23.01590 8.413477e-11 1.005158e-06
##                B
## SRGN    16.36065
## SLAMF7  16.09516
## MAP3K1  15.83045
## BACH2   15.54908
## RASGRP2 15.30227
## MTHFD2  15.28731
# How could we sort this list by Expression level?

# You may wish to interrogate 1 or 2 of these genes on www.genecards.org or
# www.proteinatlas.org.
  1. Test if MS risk genes are over-represented in the differentially expressed genes using a Chi-square test
# Get our list of MS risk genes. This has been collated from the latest MS GWAS
# paper: https://www.ncbi.nlm.nih.gov/pubmed/31604244
MSRG <- read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/MSRG.txt",
    sep = "")


# make our 2x2 contingency table.

# In the LCL dataset the ENTREZ IDs are labelled Gene.ID
isDE_LCLvB <- efit$genes$ENTREZID %in% strictDELCLvB$ENTREZID
isInPathway_MSRG = efit$genes$ENTREZID %in% MSRG$EntrezID
MSRGtab <- table(isDE_LCLvB, isInPathway_MSRG)
MSRGtab
##           isInPathway_MSRG
## isDE_LCLvB FALSE  TRUE
##      FALSE 10581   148
##      TRUE   1178    40
# Perform a Chi-square test for this table
chisq.test(MSRGtab)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  MSRGtab
## X-squared = 24.404, df = 1, p-value = 7.811e-07
# Check if we meet the assumptions for a Chi-square test
test <- chisq.test(MSRGtab)
test$expected >= 5
##           isInPathway_MSRG
## isDE_LCLvB FALSE TRUE
##      FALSE  TRUE TRUE
##      TRUE   TRUE TRUE
# Extension. There may be instances where we the expected values for some of
# the cells in our 2x2 table will be less than 5. In these cases it is more
# appropriate to a a Fishers Exact Test/hypergeometric test. We can run a
# Fishers Exact Test using our contingency table:

# fisher.test(MSRGtab)

## Extension. Consruct a heatmap showing the expression of the Ms risk genes in
## this dataset.
  1. Perform Gene Set using camera

The camera function is part of the limma package and can be used to test whether a set of genes is highly ranked relative to other genes in terms of differential expression, accounting for inter-gene correlation

# We will perform our camera test against a list of gene sets from the Broad
# Institute’s MSigDB c2 collection. The C2 gene sets have been curated from
# online databases, publications and domain experts.

# You will only have to download this once.
load(url("http://bioinf.wehi.edu.au/software/MSigDB/human_c2_v5p2.rdata"))

# We need to make indicies to match these gene sets to the genes in our
# expression data (v)
idx <- ids2indices(Hs.c2, id = v$genes$ENTREZID)
cam.LCLB <- camera(v, idx, design, contrast = "groupLCL")
head(cam.LCLB, 5)
##                                               NGenes Direction       PValue
## SOTIRIOU_BREAST_CANCER_GRADE_1_VS_3_UP           136        Up 1.138367e-25
## ROSTY_CERVICAL_CANCER_PROLIFERATION_CLUSTER      128        Up 1.225991e-21
## KOBAYASHI_EGFR_SIGNALING_24HR_DN                 224        Up 1.561198e-17
## WHITEFORD_PEDIATRIC_CANCER_MARKERS               103        Up 2.531442e-17
## GRAHAM_NORMAL_QUIESCENT_VS_NORMAL_DIVIDING_DN     80        Up 3.053613e-17
##                                                        FDR
## SOTIRIOU_BREAST_CANCER_GRADE_1_VS_3_UP        5.378783e-22
## ROSTY_CERVICAL_CANCER_PROLIFERATION_CLUSTER   2.896405e-18
## KOBAYASHI_EGFR_SIGNALING_24HR_DN              2.458887e-14
## WHITEFORD_PEDIATRIC_CANCER_MARKERS            2.885664e-14
## GRAHAM_NORMAL_QUIESCENT_VS_NORMAL_DIVIDING_DN 2.885664e-14
# The MSigBD c2 list contains some quite specific gene sets that may not be
# relevant to our data. We can filter our camera result to just include
# pathways coming from KEGG/REACTOME/BIOCARTA. These databases contain a
# smaller set of defined/canonical pathways.

cam.LCLB$pathway <- row.names(cam.LCLB)
trimmedCam_LCLvB <- dplyr::filter(cam.LCLB, grepl("KEGG|REACTOME|BIOCARTA", pathway))
head(trimmedCam_LCLvB)
##                                                                                                                            NGenes
## REACTOME_RESPIRATORY_ELECTRON_TRANSPORT                                                                                        64
## REACTOME_RESPIRATORY_ELECTRON_TRANSPORT_ATP_SYNTHESIS_BY_CHEMIOSMOTIC_COUPLING_AND_HEAT_PRODUCTION_BY_UNCOUPLING_PROTEINS_     66
## REACTOME_SYNTHESIS_OF_DNA                                                                                                      88
## REACTOME_M_G1_TRANSITION                                                                                                       76
## REACTOME_G1_S_TRANSITION                                                                                                      104
## KEGG_PROTEASOME                                                                                                                41
##                                                                                                                            Direction
## REACTOME_RESPIRATORY_ELECTRON_TRANSPORT                                                                                           Up
## REACTOME_RESPIRATORY_ELECTRON_TRANSPORT_ATP_SYNTHESIS_BY_CHEMIOSMOTIC_COUPLING_AND_HEAT_PRODUCTION_BY_UNCOUPLING_PROTEINS_        Up
## REACTOME_SYNTHESIS_OF_DNA                                                                                                         Up
## REACTOME_M_G1_TRANSITION                                                                                                          Up
## REACTOME_G1_S_TRANSITION                                                                                                          Up
## KEGG_PROTEASOME                                                                                                                   Up
##                                                                                                                                  PValue
## REACTOME_RESPIRATORY_ELECTRON_TRANSPORT                                                                                    1.867375e-15
## REACTOME_RESPIRATORY_ELECTRON_TRANSPORT_ATP_SYNTHESIS_BY_CHEMIOSMOTIC_COUPLING_AND_HEAT_PRODUCTION_BY_UNCOUPLING_PROTEINS_ 1.757448e-14
## REACTOME_SYNTHESIS_OF_DNA                                                                                                  2.529374e-13
## REACTOME_M_G1_TRANSITION                                                                                                   3.530032e-13
## REACTOME_G1_S_TRANSITION                                                                                                   3.814443e-13
## KEGG_PROTEASOME                                                                                                            5.118317e-13
##                                                                                                                                     FDR
## REACTOME_RESPIRATORY_ELECTRON_TRANSPORT                                                                                    6.852908e-13
## REACTOME_RESPIRATORY_ELECTRON_TRANSPORT_ATP_SYNTHESIS_BY_CHEMIOSMOTIC_COUPLING_AND_HEAT_PRODUCTION_BY_UNCOUPLING_PROTEINS_ 4.151970e-12
## REACTOME_SYNTHESIS_OF_DNA                                                                                                  4.167768e-11
## REACTOME_M_G1_TRANSITION                                                                                                   5.559801e-11
## REACTOME_G1_S_TRANSITION                                                                                                   5.639484e-11
## KEGG_PROTEASOME                                                                                                            7.328499e-11
##                                                                                                                                                                                                                                               pathway
## REACTOME_RESPIRATORY_ELECTRON_TRANSPORT                                                                                                                                                                       REACTOME_RESPIRATORY_ELECTRON_TRANSPORT
## REACTOME_RESPIRATORY_ELECTRON_TRANSPORT_ATP_SYNTHESIS_BY_CHEMIOSMOTIC_COUPLING_AND_HEAT_PRODUCTION_BY_UNCOUPLING_PROTEINS_ REACTOME_RESPIRATORY_ELECTRON_TRANSPORT_ATP_SYNTHESIS_BY_CHEMIOSMOTIC_COUPLING_AND_HEAT_PRODUCTION_BY_UNCOUPLING_PROTEINS_
## REACTOME_SYNTHESIS_OF_DNA                                                                                                                                                                                                   REACTOME_SYNTHESIS_OF_DNA
## REACTOME_M_G1_TRANSITION                                                                                                                                                                                                     REACTOME_M_G1_TRANSITION
## REACTOME_G1_S_TRANSITION                                                                                                                                                                                                     REACTOME_G1_S_TRANSITION
## KEGG_PROTEASOME                                                                                                                                                                                                                       KEGG_PROTEASOME
# We can also use this filtering to pull down ownly the pathways that are
# enriched for downregulated genes (same can be done for the pathways enriched
# in upregulated genes).

downCam_LCLvB <- dplyr::filter(trimmedCam_LCLvB, grepl("Down", Direction))
head(downCam_LCLvB)
##                                              NGenes Direction       PValue
## REACTOME_GENERIC_TRANSCRIPTION_PATHWAY          300      Down 1.024071e-05
## KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION     77      Down 5.445581e-04
## BIOCARTA_MYOSIN_PATHWAY                          23      Down 8.260046e-04
## REACTOME_VOLTAGE_GATED_POTASSIUM_CHANNELS        16      Down 2.748302e-03
## KEGG_LINOLEIC_ACID_METABOLISM                    14      Down 2.764361e-03
## REACTOME_INTERACTION_BETWEEN_L1_AND_ANKYRINS     12      Down 4.503994e-03
##                                                       FDR
## REACTOME_GENERIC_TRANSCRIPTION_PATHWAY       0.0001819074
## KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION 0.0052726168
## BIOCARTA_MYOSIN_PATHWAY                      0.0073500413
## REACTOME_VOLTAGE_GATED_POTASSIUM_CHANNELS    0.0194106552
## KEGG_LINOLEIC_ACID_METABOLISM                0.0194949319
## REACTOME_INTERACTION_BETWEEN_L1_AND_ANKYRINS 0.0285656008
##                                                                                   pathway
## REACTOME_GENERIC_TRANSCRIPTION_PATHWAY             REACTOME_GENERIC_TRANSCRIPTION_PATHWAY
## KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION
## BIOCARTA_MYOSIN_PATHWAY                                           BIOCARTA_MYOSIN_PATHWAY
## REACTOME_VOLTAGE_GATED_POTASSIUM_CHANNELS       REACTOME_VOLTAGE_GATED_POTASSIUM_CHANNELS
## KEGG_LINOLEIC_ACID_METABOLISM                               KEGG_LINOLEIC_ACID_METABOLISM
## REACTOME_INTERACTION_BETWEEN_L1_AND_ANKYRINS REACTOME_INTERACTION_BETWEEN_L1_AND_ANKYRINS
# We can make a barcode plot to show where genes in a pathway of interest sit
# in our gene expression data sorted by differential expression.

barcodeplot(efit$t[, "groupLCL"], index = idx$KOBAYASHI_EGFR_SIGNALING_24HR_DN, main = "EGFR signalling genes in LCLvB")

# WEHI also hosts the hallmark MSigDB gene sets which are selected to have
# well-defined biological states or processes. The hallmark lists can be
# downloaded by
# load(url('http://bioinf.wehi.edu.au/software/MSigDB/human_H_v5p2.rdata')).
# They also the host Gene Ontology (GO) gene sets:
# load(url('http://bioinf.wehi.edu.au/software/MSigDB/human_c5_v5p2.rdata'))

## Extension. See if any of the hallmark/GO gene sets are enriched in our
## dataset.

2.2 Extension: Influenza pneumonia dataset

The following data comes from a manuscript “A distinct influenza infection signature in the blood transcriptome of patients with severe community-acquired pneumonia”.https://www.ncbi.nlm.nih.gov/pubmed/22898401 .

Diagnosis of severe influenza pneumonia remains challenging because of a lack of correlation between the presence of influenza virus and clinical status. Gene-expression profiling was conducted on the whole blood of critically ill patients to identify a gene signature that would allow clinicians to distinguish influenza infection from other causes of severe respiratory failure, such as bacterial pneumonia, and noninfective systemic inflammatory response syndrome. Samples were assayed on Illumina HT-12 gene-expression beadarrays.

  1. Read in the data.
# Retrieve dataset from GEO. getGEO gives you a list of expressionsets, as we
# only have one, we might as well just pull out that one (if (length(gset) > 1)
# idx <- grep('GPL6947', attr(gset, 'names')) else idx <- 1 | gset <-
# gset[[idx]])

dir.create("GSE40012")
## Warning in dir.create("GSE40012"): 'GSE40012' already exists
gset <- getGEO("GSE40012", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = "GSE40012")[[1]]

# Have a look at the summary info for the ExpressionSet gset

# This dataset has many variable labels (including phenotype information). The
# phenotype data in an expressionset is stored in pData. Let's use the
# phenotype data to pull out just the day 1 samples
pheno <- pData(gset)
# colnames(pData(gset))
D1 <- gset[, pheno$`day:ch1` == "1"]


# Phenotype data
pheno <- pData(D1)

# The expression values are stored within 'exprs'. For this dataset the
# expression values have already been normalised using Quantile normalisation,
# but they have not been log transformed. So we will log transform the
# expression values in D1.
exprs(D1) <- log2(exprs(D1))
expVal <- exprs(D1)



# Each probe of the microarray has detailed annotation. To see the specific
# annotation levels we can type fData. Currently the rownames of our expression
# data is the Illumina probe id. This is not convenient for pulling down a gene
# symbol of interest. More gene ID types can be accessed with fData
geneAnno <- fData(D1)
# I wish to shorten the sample types to say bact, flu, mixed, SIRS, HC, As they
# are they are too long to easily display on heatmaps etc.
sampleType <- pheno$`sample type:ch1`
shortNames <- sub("healthy control", "HC", sampleType)
shortNames <- sub("systemic inflammatory response", "SIRS", shortNames)
shortNames <- sub("mixed bacterial and influenza A pneumonia", "mixed", shortNames)
shortNames <- sub("influenza A pneumonia", "flu", shortNames)
shortNames <- sub("bacterial pneumonia", "bact", shortNames)
shortNames
##  [1] "SIRS"  "bact"  "bact"  "SIRS"  "bact"  "bact"  "flu"   "flu"   "bact" 
## [10] "flu"   "flu"   "SIRS"  "SIRS"  "bact"  "HC"    "HC"    "HC"    "HC"   
## [19] "HC"    "HC"    "HC"    "HC"    "HC"    "HC"    "HC"    "HC"    "HC"   
## [28] "HC"    "HC"    "HC"    "HC"    "HC"    "mixed" "SIRS"  "SIRS"  "SIRS" 
## [37] "bact"  "bact"  "bact"  "SIRS"  "flu"   "flu"   "SIRS"  "bact"  "flu"  
## [46] "mixed" "bact"  "bact"  "bact"  "SIRS"  "SIRS"  "flu"   "bact"  "bact" 
## [55] "SIRS"  "mixed" "bact"
# store this phenotype information as a data frame
pheno$group <- shortNames
  1. Explore the data with PCA
# Lets take a first look at the dataset. We can perform a PCA analysis.
useGenes <- rowMeans(expVal) > 7
pca <- prcomp(t(expVal[useGenes, ]), scale = TRUE)
autoplot(pca, pheno, colour = "group", size = 2)

autoplot(pca, x = 3, y = 4, pheno, colour = "group", size = 2)

# PCA extension!!! Start again and perform a PCA on the dataset that has up to
# 5 days for each patient
  1. Construct a design matrix
# As we have simplified our dataset our analysis model is fairly simple, just
# comparing between the 2 groups
design <- model.matrix(~0 + group, pheno)
design
##           groupbact groupflu groupHC groupmixed groupSIRS
## GSM983405         0        0       0          0         1
## GSM983407         1        0       0          0         0
## GSM983416         1        0       0          0         0
## GSM983418         0        0       0          0         1
## GSM983423         1        0       0          0         0
## GSM983428         1        0       0          0         0
## GSM983429         0        1       0          0         0
## GSM983434         0        1       0          0         0
## GSM983439         1        0       0          0         0
## GSM983441         0        1       0          0         0
## GSM983445         0        1       0          0         0
## GSM983452         0        0       0          0         1
## GSM983455         0        0       0          0         1
## GSM983456         1        0       0          0         0
## GSM983459         0        0       1          0         0
## GSM983461         0        0       1          0         0
## GSM983463         0        0       1          0         0
## GSM983465         0        0       1          0         0
## GSM983467         0        0       1          0         0
## GSM983469         0        0       1          0         0
## GSM983471         0        0       1          0         0
## GSM983473         0        0       1          0         0
## GSM983475         0        0       1          0         0
## GSM983477         0        0       1          0         0
## GSM983479         0        0       1          0         0
## GSM983481         0        0       1          0         0
## GSM983483         0        0       1          0         0
## GSM983485         0        0       1          0         0
## GSM983487         0        0       1          0         0
## GSM983489         0        0       1          0         0
## GSM983491         0        0       1          0         0
## GSM983493         0        0       1          0         0
## GSM983497         0        0       0          1         0
## GSM983501         0        0       0          0         1
## GSM983505         0        0       0          0         1
## GSM983508         0        0       0          0         1
## GSM983510         1        0       0          0         0
## GSM983514         1        0       0          0         0
## GSM983515         1        0       0          0         0
## GSM983518         0        0       0          0         1
## GSM983524         0        1       0          0         0
## GSM983529         0        1       0          0         0
## GSM983534         0        0       0          0         1
## GSM983539         1        0       0          0         0
## GSM983544         0        1       0          0         0
## GSM983547         0        0       0          1         0
## GSM983551         1        0       0          0         0
## GSM983555         1        0       0          0         0
## GSM983557         1        0       0          0         0
## GSM983562         0        0       0          0         1
## GSM983570         0        0       0          0         1
## GSM983574         0        1       0          0         0
## GSM983577         1        0       0          0         0
## GSM983578         1        0       0          0         0
## GSM983583         0        0       0          0         1
## GSM983585         0        0       0          1         0
## GSM983590         1        0       0          0         0
## attr(,"assign")
## [1] 1 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
colnames(design) <- sub("group", "", colnames(design))

cont.matrix <- makeContrasts(fluvsbact = flu - bact, levels = design)
cont.matrix
##        Contrasts
## Levels  fluvsbact
##   bact         -1
##   flu           1
##   HC            0
##   mixed         0
##   SIRS          0
  1. Fit linear model using limma
# fit the model
fit <- lmFit(D1, design)

# fit the model for the specified contrasts. For us this is only an option of
# 1, but if we had more than 2 groups here we could define what groups to
# compare

fit2 <- contrasts.fit(fit, cont.matrix)

# Create moderated test statistics using eBayes
efit <- eBayes(fit2)
  1. Explore results
# Quick look at how many genes are differentially expressed
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)
##        fluvsbact
## Down         612
## NotSig     46881
## Up          1261
# How does the number of DE change if we instead use the more stringent
# bonferroni correction?
dt_bonf <- decideTests(efit, adjust.method = "bonferroni", p.value = 0.05)
summary(dt_bonf)
##        fluvsbact
## Down          18
## NotSig     48599
## Up           137
# histogram of p value distribution for groupLCL
hist(efit$p.value[, "fluvsbact"])

# volcano plot
volcanoplot(efit, coef = "fluvsbact", highlight = 10, names = geneAnno$`Gene symbol`)

# Generate a table of the genes significantly differentially expressed in
# fluvsbact
DE_fluvsbact <- topTable(efit, coef = "fluvsbact", adjust.method = "fdr", p.value = 0.05,
    number = Inf)
# we can write this table to a file
write.csv(DE_fluvsbact, file = "fluvsbact.csv")

# Generate a mean-difference plot (MD-plot)
colnames(efit)
## [1] "fluvsbact"
glMDPlot(efit, coef = 1, status = dt_bonf[, 1], main = colnames(efit)[1], side.main = "UNIQUEID",
    counts = expVal, groups = pheno$group, launch = TRUE)
  1. Create a gene expression heatmap
# Get a list of the top 40 most DE genes. We will use this to draw a heatmap
topGenes <- rownames(DE_fluvsbact)[1:40]

# pull out the expression data for these 40 genes
i <- which(rownames(D1) %in% topGenes)
coolmap(expVal[i, ], labCol = pheno$group, labRow = DE_fluvsbact$Gene.symbol, margin = c(8,
    6), cexRow = 0.7, cexCol = 0.7)

  1. Draw a plot of a gene of interest from the data
# Now we can pull down our gene of interest by its gene symbol.
useGene <- rownames(expVal)[which(geneAnno$`Gene symbol` == "IFI27")]
df <- data.frame(pheno, IFI27 = expVal[useGene, ])
ggplot(df, aes(x = group, y = IFI27, colour = group)) + geom_boxplot() + geom_point() +
    theme_classic()

  1. Test if a publised set of inteferon stimulated genes (ISGs) are over-represented in the differentially expressed genes using a chi-square test.

This list of ISGs was obtained from the following publication. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3409588/ . This is an example of how we can test for over-representation of a curated gene list/list of DE genes from a related study in our data.

ISG <- read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/ISG.txt",
    sep = "\t")
# This gene list only has gene symbols. Lets add ENTREZIDs to this list using
# mapIds.
ISG$ENTREZID <- mapIds(Homo.sapiens, keys = as.character(ISG$Gene.symbol), column = ("ENTREZID"),
    keytype = "SYMBOL", multiVals = "first")
## 'select()' returned 1:1 mapping between keys and columns
# make our 2x2 contingency table

# In the flu dataset the ENTREZ IDs are labelled Gene.ID
isDE <- efit$genes$ID %in% DE_fluvsbact$ID
isInPathway_ISG = efit$genes$`Gene ID` %in% ISG$ENTREZID
ISGtab <- table(isDE, isInPathway_ISG)
ISGtab
##        isInPathway_ISG
## isDE    FALSE  TRUE
##   FALSE 46454   427
##   TRUE   1799    74
# Perform a Chi-square test for this table
chisq.test(ISGtab)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ISGtab
## X-squared = 160.69, df = 1, p-value < 2.2e-16
# Check we meet the assumptions for a Chi-square test
test <- chisq.test(ISGtab)
test$expected >= 5
##        isInPathway_ISG
## isDE    FALSE TRUE
##   FALSE  TRUE TRUE
##   TRUE   TRUE TRUE
# Extension. There may be instances where we the expected values for some of
# the cells in our 2x2 table will be less than 5. In these cases it is more
# appropriate to a a Fishers Exact Test/hypergeometric test. We can run a
# Fishers Exact Test using our contingency table:

# fisher.test(ISGtab)
  1. Perform Gene Set using camera

The camera function is part of the limma package and can be used to test whether a set of genes is highly ranked relative to ther genes in terms of differential expression, accounting for inter-gene correlation

# We will perform our camera test against a list of gene sets from the Broad
# Institute’s MSigDB c2 collection. The C2 gene sets have been curated from
# online databases, publications and domain experts.

# You will only have to download this once.
load(url("http://bioinf.wehi.edu.au/software/MSigDB/human_c2_v5p2.rdata"))

# We need to make indicies to match these gene sets to the genes in our
# expression data (v)
idx <- ids2indices(Hs.c2, id = efit$genes$`Gene ID`)
cam.fluVsBact <- camera(exprs(D1), idx, design, contrast = cont.matrix[, 1])
head(cam.fluVsBact, 5)
##                                               NGenes Direction       PValue
## GRAHAM_NORMAL_QUIESCENT_VS_NORMAL_DIVIDING_DN    129        Up 1.884114e-58
## SOTIRIOU_BREAST_CANCER_GRADE_1_VS_3_UP           211        Up 4.763989e-53
## KANG_DOXORUBICIN_RESISTANCE_UP                    78        Up 1.613472e-52
## ROSTY_CERVICAL_CANCER_PROLIFERATION_CLUSTER      206        Up 1.043450e-49
## CROONQUIST_IL6_DEPRIVATION_DN                    154        Up 1.414259e-45
##                                                        FDR
## GRAHAM_NORMAL_QUIESCENT_VS_NORMAL_DIVIDING_DN 8.909977e-55
## SOTIRIOU_BREAST_CANCER_GRADE_1_VS_3_UP        1.126445e-49
## KANG_DOXORUBICIN_RESISTANCE_UP                2.543370e-49
## ROSTY_CERVICAL_CANCER_PROLIFERATION_CLUSTER   1.233619e-46
## CROONQUIST_IL6_DEPRIVATION_DN                 1.337607e-42
# The MSigBD c2 list contains some quite specific gene sets that may not be
# relevant to our data. We can filter our camera result to just include
# pathways coming from KEGG/REACTOME/BIOCARTA. These databases contain a smaler
# set of defined/canonical pathways.

cam.fluVsBact$pathway <- row.names(cam.fluVsBact)
trimmedCam_fluVsBact <- dplyr::filter(cam.fluVsBact, grepl("KEGG|REACTOME|BIOCARTA",
    pathway))
head(trimmedCam_fluVsBact)
##                                                              NGenes Direction
## REACTOME_G2_M_CHECKPOINTS                                        72        Up
## KEGG_DNA_REPLICATION                                             53        Up
## REACTOME_DNA_STRAND_ELONGATION                                   48        Up
## REACTOME_ACTIVATION_OF_ATR_IN_RESPONSE_TO_REPLICATION_STRESS     61        Up
## REACTOME_ACTIVATION_OF_THE_PRE_REPLICATIVE_COMPLEX               44        Up
## REACTOME_DNA_REPLICATION                                        288        Up
##                                                                    PValue
## REACTOME_G2_M_CHECKPOINTS                                    1.027524e-25
## KEGG_DNA_REPLICATION                                         2.768964e-25
## REACTOME_DNA_STRAND_ELONGATION                               6.905666e-24
## REACTOME_ACTIVATION_OF_ATR_IN_RESPONSE_TO_REPLICATION_STRESS 2.180679e-23
## REACTOME_ACTIVATION_OF_THE_PRE_REPLICATIVE_COMPLEX           5.456161e-23
## REACTOME_DNA_REPLICATION                                     3.531181e-21
##                                                                       FDR
## REACTOME_G2_M_CHECKPOINTS                                    1.104355e-23
## KEGG_DNA_REPLICATION                                         2.909874e-23
## REACTOME_DNA_STRAND_ELONGATION                               6.047573e-22
## REACTOME_ACTIVATION_OF_ATR_IN_RESPONSE_TO_REPLICATION_STRESS 1.747870e-21
## REACTOME_ACTIVATION_OF_THE_PRE_REPLICATIVE_COMPLEX           4.229867e-21
## REACTOME_DNA_REPLICATION                                     2.420138e-19
##                                                                                                                   pathway
## REACTOME_G2_M_CHECKPOINTS                                                                       REACTOME_G2_M_CHECKPOINTS
## KEGG_DNA_REPLICATION                                                                                 KEGG_DNA_REPLICATION
## REACTOME_DNA_STRAND_ELONGATION                                                             REACTOME_DNA_STRAND_ELONGATION
## REACTOME_ACTIVATION_OF_ATR_IN_RESPONSE_TO_REPLICATION_STRESS REACTOME_ACTIVATION_OF_ATR_IN_RESPONSE_TO_REPLICATION_STRESS
## REACTOME_ACTIVATION_OF_THE_PRE_REPLICATIVE_COMPLEX                     REACTOME_ACTIVATION_OF_THE_PRE_REPLICATIVE_COMPLEX
## REACTOME_DNA_REPLICATION                                                                         REACTOME_DNA_REPLICATION
# We can also use this filtering to pull down ownly the pathways that are
# enriched for downregulated genes (same can be done for the pathways enriched
# in upregulated genes).

downCam_fluVsBact <- dplyr::filter(trimmedCam_fluVsBact, grepl("Down", Direction))
head(downCam_fluVsBact)
##                                                                           NGenes
## BIOCARTA_HCMV_PATHWAY                                                         39
## REACTOME_IL_3_5_AND_GM_CSF_SIGNALING                                          76
## REACTOME_REGULATION_OF_SIGNALING_BY_CBL                                       32
## REACTOME_RECYCLING_PATHWAY_OF_L1                                              50
## REACTOME_IL_2_SIGNALING                                                       72
## REACTOME_LATENT_INFECTION_OF_HOMO_SAPIENS_WITH_MYCOBACTERIUM_TUBERCULOSIS     46
##                                                                           Direction
## BIOCARTA_HCMV_PATHWAY                                                          Down
## REACTOME_IL_3_5_AND_GM_CSF_SIGNALING                                           Down
## REACTOME_REGULATION_OF_SIGNALING_BY_CBL                                        Down
## REACTOME_RECYCLING_PATHWAY_OF_L1                                               Down
## REACTOME_IL_2_SIGNALING                                                        Down
## REACTOME_LATENT_INFECTION_OF_HOMO_SAPIENS_WITH_MYCOBACTERIUM_TUBERCULOSIS      Down
##                                                                                 PValue
## BIOCARTA_HCMV_PATHWAY                                                     3.664485e-05
## REACTOME_IL_3_5_AND_GM_CSF_SIGNALING                                      4.312942e-05
## REACTOME_REGULATION_OF_SIGNALING_BY_CBL                                   8.121147e-05
## REACTOME_RECYCLING_PATHWAY_OF_L1                                          2.523134e-04
## REACTOME_IL_2_SIGNALING                                                   4.757641e-04
## REACTOME_LATENT_INFECTION_OF_HOMO_SAPIENS_WITH_MYCOBACTERIUM_TUBERCULOSIS 5.316315e-04
##                                                                                    FDR
## BIOCARTA_HCMV_PATHWAY                                                     0.0003239131
## REACTOME_IL_3_5_AND_GM_CSF_SIGNALING                                      0.0003749247
## REACTOME_REGULATION_OF_SIGNALING_BY_CBL                                   0.0006773352
## REACTOME_RECYCLING_PATHWAY_OF_L1                                          0.0018356771
## REACTOME_IL_2_SIGNALING                                                   0.0033038009
## REACTOME_LATENT_INFECTION_OF_HOMO_SAPIENS_WITH_MYCOBACTERIUM_TUBERCULOSIS 0.0036383288
##                                                                                                                                             pathway
## BIOCARTA_HCMV_PATHWAY                                                                                                         BIOCARTA_HCMV_PATHWAY
## REACTOME_IL_3_5_AND_GM_CSF_SIGNALING                                                                           REACTOME_IL_3_5_AND_GM_CSF_SIGNALING
## REACTOME_REGULATION_OF_SIGNALING_BY_CBL                                                                     REACTOME_REGULATION_OF_SIGNALING_BY_CBL
## REACTOME_RECYCLING_PATHWAY_OF_L1                                                                                   REACTOME_RECYCLING_PATHWAY_OF_L1
## REACTOME_IL_2_SIGNALING                                                                                                     REACTOME_IL_2_SIGNALING
## REACTOME_LATENT_INFECTION_OF_HOMO_SAPIENS_WITH_MYCOBACTERIUM_TUBERCULOSIS REACTOME_LATENT_INFECTION_OF_HOMO_SAPIENS_WITH_MYCOBACTERIUM_TUBERCULOSIS
# We can make a barcode plot to show where genes in a pathway of interest sit
# in our gene expression data sorted by differential expression.

barcodeplot(efit$t[, "fluvsbact"], index = idx$KOBAYASHI_EGFR_SIGNALING_24HR_DN,
    main = "EGFR signalling genes in fluvsbact")

# WEHI also hosts the hallmark MSigDB gene sets which are selected to have
# well-defined biological states or processes. The hallmark lists can be
# downloaded by
# load(url('http://bioinf.wehi.edu.au/software/MSigDB/human_H_v5p2.rdata')).
# They also the host Gene Ontology (GO) gene sets:
# load(url('http://bioinf.wehi.edu.au/software/MSigDB/human_c5_v5p2.rdata'))

## Extension. See if any of the hallmark/GO gene sets are enriched in our
## dataset.

2.3 Extension 2 : RNA-seq 1-2-3 protocol

You may wish to browse the RNA-seq 1-2-3 protocol published in F1000Research which performs an RNAseq analysis on different cell populations harvested from mice mammary gland tissue. You could work through the code in RStudio step by step for extra practice. Note that their workflow will proceed all the way to differential gene expression testing and pathway analysis which we will not visit until weeks 8 and 9. See https://f1000research.com/articles/5-1408 .

3 Lab report

Only this section needs to be included in your report. This report will be due at the end of Week 9, Friday 11:59pm 26th April.

While I expect that you should explore the TCGA melanoma data, I am not opposed to you submiting a report on a dataset that you are incredibly engaged with.

Report purpose

The purpose of this report is to:

  1. Provide you with an opportunity to apply the methods and skills you have been exposed to in class.

  2. Create a document that can aid you when asking for feedback from tutors.

  3. Practice creating reports for your mid-semester exam.

Because of this, do not feel restricted to using the dataset I have provided. Feel free to download any dataset that you find interesting and apply what you have learnt in class.

Report guidelines

There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:

  • Think about how your report might be structured (eg. Summary, Introduction, Results, Conclusion).
  • Do your figures have captions?
  • Are your figures legible?
  • Are you using sub-headings effectively?

It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”

Marking criteria

  • 1 mark - Exploratory data analysis, try a PCA plot.
  • 1 mark - Perform differential expression analysis.
  • 1 mark - Perform pathway analysis, what are the most significant pathways?
  • 0.5 mark - Comment – Make a reasonable effort to comment and interpret your findings.
  • 0.5 mark - Presentation – Rmarkdown, no extra R output, use of headings.

Lab instructions

3.1 TCGA Melanoma

There are at least 200 forms of cancer, and many more subtypes. Each of these is caused by errors in DNA that cause cells to grow uncontrolled. Identifying the changes in each cancer’s complete set of DNA - its genome - and understanding how such changes interact to drive the disease will lay the foundation for improving cancer prevention, early detection and treatment.

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, 2.5 petabytes of data describing tumor tissue and matched normal tissues from more than 11,000 patients, is publically available and has been used widely by the research community. The data have contributed to more than a thousand studies of cancer by independent researchers and to the TCGA research network publications.

This data can be downloaded from the data portal in TCGA (https://portal.gdc.cancer.gov/) but is in an aweful format. Thankfully, the team at Recount2 https://jhubiostatistics.shinyapps.io/recount/ have reprocessed all of the TCGA data. In the following, I have code to download the TCGA melanoma data. Many of these code lines only need to be run once.

3.2 Questions?

  • Read in the data and experiment with normalization.

  • Can you see any structure in a PCA?

  • Is this structure explained by any variables in the clinical data?

3.3 Read files that I processed

Let’s start by reading in the data.

### Install recount from Bioconductor, do this once only!!!!
### install.packages('BiocManager') BiocManager::install('recount')

library(recount)
library(dplyr)
### Browse the vignetets for a quick description of how to use the package
### browseVignettes('recount')

### Download the RangedSummarizedExperiment object at the gene level for TCGA
### melanoma data.  url <-
### 'http://duffel.rail.bio/recount/v2/TCGA/rse_gene_skin.Rdata' View the url
### for the file by printing the object url url Download the file
### download.file(url = url, destfile = 'rse_gene_skin.Rdata', method = 'auto',
### mode = 'wb')

### Load the data
load("rse_gene_skin.Rdata")

### rse_gene is a RangesSummarizedExperiment
rse_gene

### Gene counts
counts <- assay(rse_gene)

### Gene identifiers
genes <- rowData(rse_gene)

### Phenotype data for the samples
samples <- as.data.frame(colData(rse_gene))

### There are 864 variables in samples, 593 are completely missing
samples %>%
    is.na() %>%
    colSums() %>%
    table()