Aims:
Perform an over-representation test
Perform a pathway analysis
Please make an effort to complete your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations. Try to use your expanding knowledge of R to modify this example code in places where you think it can be altered/improved. Ultimately, this workflow will go from obtaining data from GEO through to pathway analysis, however we will complete the workflow in 3 stages. In week 7 we will focus on obtaining the data and getting through to PCA analysis. In week 8 we will perfom differential gene expression analysis and some visualisations, and in week 9 we will get to the pathway analysis. The workflow uses popular bioconductor packages and is adapted from the published RNAseq 1-2-3 workflow (https://f1000research.com/articles/5-1408).
The following data comes from a manuscript “Evidence from genome wide association studies implicates reduced control of Epstein-Barr virus infection in multiple sclerosis susceptibility” (https://www.ncbi.nlm.nih.gov/pubmed/31039804). There were multiple goals for the RNAseq component of project:
Identify the gene expression changes that occur when B cells are infected with EBV and transformed into LCLs.
Determine which pathways are over-represented in the differentially expressed genes.
Assess if there are more MS risk genes then expected by chance in the differentially expressed genes.
Use this dataset as an opportunity to explore the behaviour of a small RNA-Seq experiment. In this lab we will start by just processing and exploring the data.
IMPORTANT: The code for week 9 continues on from the code generated in weeks 7 and 8. The new steps introduced this week start from step 11.
1A. Install / load required R packages
library(tidyverse)
library(ggfortify)
library(RColorBrewer)
# If any packages of these required are not installed, they can be installed by
# first installing bioconductor: if (!requireNamespace('BiocManager', quietly =
# TRUE)) install.packages('BiocManager') then using
# BiocManager::install('PACKAGENAME')
# load packages to be used in the analysis
library(GEOquery)
library(org.Hs.eg.db)
library(limma)
library(edgeR)
library(Glimma)
library(Homo.sapiens)
library(gplots)
library(dplyr)
1B. Read in the data.
# The count data for this GEO submission is available as a supplementary file on
# GEO. You only want to run this once...
# sfiles <- getGEOSuppFiles('GSE126379') fnames <- rownames(sfiles)
# There is only one supplemental file Data = read.delim(fnames[1],header=TRUE)
# head(Data)
# Next time you can just run...
fnames <- "GSE126379/GSE126379_BLCL_processed.txt.gz"
Data = read.delim(fnames[1], header = TRUE)
head(Data)
## Chromosome Gene.Name Gene.ID Sum.of.Exon.Length.of.Gene
## 1 chr10 A1CF A1CF 9529
## 2 chr10 ABCC2 ABCC2 5051
## 3 chr10 ABI1 ABI1 3776
## 4 chr10 ABLIM1 ABLIM1 8303
## 5 chr10 ACADSB ACADSB 5941
## 6 chr10 ACBD5 ACBD5 4066
## Reads.Counts..BLCL_A1. RPKM..BLCL_A1. Reads.Counts..BLCL_A2. RPKM..BLCL_A2.
## 1 2 0.0113048 1 0.009967274
## 2 21 0.2239352 42 0.789760929
## 3 1975 28.1718559 1436 36.119869850
## 4 786 5.0987985 2619 29.958784090
## 5 756 6.8539761 543 8.680884936
## 6 535 7.0870705 421 9.834186588
## Reads.Counts..BLCL_B1. RPKM..BLCL_B1. Reads.Counts..BLCL_B2. RPKM..BLCL_B2.
## 1 1 0.01277064 4 0.02185651
## 2 10 0.24092549 89 0.91744656
## 3 563 18.14414625 2307 31.81147083
## 4 220 3.22439153 4464 27.99350563
## 5 210 4.30149944 877 7.68613807
## 6 200 5.98580756 674 8.63099098
## Reads.Counts..BLCL_C1. RPKM..BLCL_C1. Reads.Counts..BLCL_C2. RPKM..BLCL_C2.
## 1 0 0.0000000 11 0.122567
## 2 13 0.2041644 63 1.324315
## 3 695 14.6004734 1235 34.726663
## 4 400 3.8215458 2600 33.248066
## 5 264 3.5249949 568 10.151188
## 6 304 5.9308963 492 12.847712
## Reads.Counts..BLCL_D1. RPKM..BLCL_D1. Reads.Counts..BLCL_D2. RPKM..BLCL_D2.
## 1 2 0.02147553 10 0.04834247
## 2 25 0.50643522 163 1.48657342
## 3 847 22.95158503 2806 34.23196323
## 4 626 7.71437158 5804 32.20093570
## 5 272 4.68457946 1065 8.25783462
## 6 250 6.29120581 1139 12.90424131
## Reads.Counts..BLCL_E1. RPKM..BLCL_E1. Reads.Counts..BLCL_E2. RPKM..BLCL_E2.
## 1 1 0.01387007 7 0.04338079
## 2 54 1.41300065 130 1.51989185
## 3 986 34.51206142 2471 38.64447949
## 4 177 2.81750179 4484 31.89167049
## 5 302 6.71851610 1039 10.32767820
## 6 313 10.17426241 932 13.53615256
# MANIPULATE GENE ANNOTATIONS
# Genes have multiple 'identifiers' in different databases. Later on we will need
# identifiers from the ENTREZ database. To save us some pain later, lets extract
# these now using the select() function from the AnnotationDbi package. The
# ENTREZ identifiers can be found in the Homo.sapiens package.
ENTREZID <- mapIds(Homo.sapiens, keys = as.character(Data$Gene.Name), column = c("ENTREZID"),
keytype = "SYMBOL", multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
genes <- data.frame(ENTREZID, SYMBOL = Data$Gene.Name)
# In this file of counts, there are multiple instances where the same gene symbol
# or entrez id appears in multiple rows. This may be due to how it was initially
# generated but as we are not able to distinguish between them (and in most cases
# the count information is very similar for the duplicate rows, we will just ask
# for just one row to be kept for any given gene name)
chooseGenes <- which(!duplicated(genes$SYMBOL) & !duplicated(genes$ENTREZID))
genes <- genes[chooseGenes, ]
Data <- Data[chooseGenes, ]
# MANIPULATE GENE COUNTS
# We do not need the RPKMs just the read counts, so lets just pull out these
# columns colnames(Data)
counts <- Data[, grep("Counts", colnames(Data))]
# LCL are labelled 1, B cell labelled 2. Lets rename the columns to reflect this
colnames(counts) <- sub("Reads.Counts..BLCL_", "", colnames(counts))
colnames(counts) <- sub("1", "LCL", colnames(counts))
colnames(counts) <- sub("2", "CD19B", colnames(counts))
colnames(counts) <- sub("CD19B.", "_CD19B", colnames(counts))
colnames(counts) <- sub("LCL.", "_LCL", colnames(counts))
colnames(counts)
## [1] "A_LCL" "A_CD19B" "B_LCL" "B_CD19B" "C_LCL" "C_CD19B" "D_LCL"
## [8] "D_CD19B" "E_LCL" "E_CD19B"
# make the Gene IDs the name of the rows
rownames(counts) <- genes$SYMBOL
# MANIPULATE SAMPLE ANNOTATIONS
# This is a function that will be in the next release of tidyverse. We can use
# it to split a string on a certain pattern and then select things that are
# before or after than split
str_split_n <- function(string, pattern, n) {
out <- str_split(string, pattern, simplify = TRUE)
apply(out, 1, `[`, i = n)
}
# use sample names to make factors for group and individual
group <- str_split_n(colnames(counts), "_", 2) %>% factor()
individual <- str_split_n(colnames(counts), "_", 1) %>% factor()
# put this factor information into a data frame
samples <- data.frame(group, individual)
samples
## group individual
## 1 LCL A
## 2 CD19B A
## 3 LCL B
## 4 CD19B B
## 5 LCL C
## 6 CD19B C
## 7 LCL D
## 8 CD19B D
## 9 LCL E
## 10 CD19B E
# MAKE DGEList
# make DGEList object to connect gene counts with sample information
BLCL <- DGEList(counts = counts, samples = samples, genes = genes)
# BLCL
# COUNTS NORMALIZATION
# First we calculate normalization factors
BLCL <- calcNormFactors(BLCL, method = "TMM")
# make a copy of BLCLDGEList to use in extension tab.
BLCL2 <- BLCL
# show the nomalisation factors. For this dataset the effect of TMM-normalisation
# is mild, as evident in the magnitude of the scaling factors, which are all
# relatively close to 1.
BLCL$samples$norm.factors
## [1] 1.0140834 1.0134316 0.9123651 0.9568436 0.9213837 1.1034365 0.8930159
## [8] 1.0526869 1.1040319 1.0563174
# Next we will keep genes with large enough counts to be considered useful.
keep <- filterByExpr(BLCL)
BLCL <- BLCL[keep, ]
# Calculate the counts per million reads for each gene for all samples
cpm <- cpm(BLCL)
# now calculate the log of these CPM values
lcpm <- cpm(BLCL, log = TRUE)
# Perform PCA
pca <- prcomp(t(lcpm), scale = TRUE)
# Plot results
autoplot(pca, data = samples, colour = "group", shape = "individual")
# Another method for dimension reduction is called MDS.
plotMDS(lcpm, labels = group)
plotMDS(lcpm, labels = individual)
# I also like this cool interactive version of a MDS plot which can be generated
# using Glimma
glMDSPlot(lcpm, labels = paste(group, individual, sep = "_"), groups = BLCL$samples[,
c("group", "individual")], launch = TRUE)
limma
is a versatile package for DE analysis of microarray data. However, the count based data from RNAseq has certain stastical properties that are different to microarray based data, namely that in RNAseq count data, the variance is not independent of the mean. The authors of limma
delevoped the voom
function which accounts for this and modifies the RNAseq count data for use with the limma
package.
# We first need to set up our design matrix, where we tell limma that we are
# interested in fitting the gene expression data against group, but also to
# include the 'individual' term, as this will account for the samples being
# paired.
design <- model.matrix(~individual + group)
design
## (Intercept) individualB individualC individualD individualE groupLCL
## 1 1 0 0 0 0 1
## 2 1 0 0 0 0 0
## 3 1 1 0 0 0 1
## 4 1 1 0 0 0 0
## 5 1 0 1 0 0 1
## 6 1 0 1 0 0 0
## 7 1 0 0 1 0 1
## 8 1 0 0 1 0 0
## 9 1 0 0 0 1 1
## 10 1 0 0 0 1 0
## attr(,"assign")
## [1] 0 1 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$individual
## [1] "contr.treatment"
##
## attr(,"contrasts")$group
## [1] "contr.treatment"
# Voom starts with the count data, transforms it into logCPM, estimates the
# mean-variance relationship, and uses this to compute appropriate
# observation-level weights. The data can then be used for linear modelling in
# the same way as microarray data. voom generates an EList object, which is
# similar to the DGEList-object
v <- voom(BLCL, design, plot = TRUE)
# The voom plot shows that whilst not strong, there is some relationship between
# variance and expression level in this data. v
# Linear modelling in limma is carried out using the 'lmFit' and 'contrasts.fit'
# functions. These functions fit a separate model to the expression values for
# each gene. We now tell limma to fit the model specified in our design matrix
# for the voom transformed data
fit <- lmFit(v, design)
# Next, empirical Bayes moderation is carried out by borrowing information across
# all genes to obtain more precise estimates of gene-wise variability.
efit <- eBayes(fit)
# Quick look at how many genes are differentially expressed
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)
## (Intercept) individualB individualC individualD individualE groupLCL
## Down 354 1 0 1 10 4220
## NotSig 947 12173 12169 12172 12156 3613
## Up 10874 1 6 2 9 4342
# The coefficient we are interested in is 'groupLCL'. Does the number of DE
# genes change if we instead use the more stringent bonferroni correction?
dt_bonf <- decideTests(efit, adjust.method = "bonferroni", p.value = 0.05)
summary(dt_bonf)
## (Intercept) individualB individualC individualD individualE groupLCL
## Down 34 1 0 1 5 799
## NotSig 3114 12174 12173 12173 12168 10140
## Up 9027 0 2 1 2 1236
# histogram of p value distribution for groupLCL colnames(efit)
hist(efit$p.value[, "groupLCL"])
# volcano plot of results for groupLCL
volcanoplot(efit, coef = "groupLCL", highlight = 10, names = efit$genes$SYMBOL)
# Generate a table of the genes significantly differentially expressed in LCLs v
# B cells (groupLCL)
DELCLvB <- topTable(efit, coef = "groupLCL", adjust.method = "fdr", p.value = 0.05,
number = Inf)
# we can write this table to a file
write.csv(DELCLvB, file = "DELCLvB.csv")
# Generate a mean-difference plot (MD-plot) for the group comparison (groupLCL,
# which is column 6 of the efit object)
useCoef <- which(colnames(efit) == "groupLCL")
plotMD(efit, column = useCoef, status = dt_fdr[, useCoef], main = colnames(efit)[useCoef],
)
# we can also generate an interactive MDplot using Glimma
glMDPlot(efit, coef = useCoef, status = dt_fdr[, useCoef], main = colnames(efit)[useCoef],
side.main = "ENTREZID", counts = lcpm, groups = group, launch = TRUE)
# Get a list of the top 40 most DE genes. We will use this to draw a heatmap
LCLvB.topgenes <- DELCLvB$SYMBOL[1:40]
# pull out the voom expression data for these 40 genes
i <- which(v$genes$SYMBOL %in% LCLvB.topgenes)
coolmap(lcpm[i, ], margin = c(8, 6), cexRow = 0.7, cexCol = 0.7)
# You could save this with the pdf() function.
pdf("heatmap.pdf", width = 12, height = 16)
coolmap(lcpm[i, ], margin = c(8, 6))
dev.off()
## png
## 2
# You might like to look at the library pheatmap
CD40 <- lcpm["CD40", ]
df <- data.frame(CD40, group, individual)
ggplot(df, aes(x = group, y = CD40, colour = individual)) + geom_point() + geom_line(aes(group = individual)) +
theme_classic()
Instead of using a bonferroni corrected threshold p value of 0.05, let’s reduce this to 0.01. We can also restrict the list of genes to those that are at least 4 fold different between groups (lfc = 2).
strictDELCLvB <- topTable(efit, coef = "groupLCL", adjust.method = "bonferroni",
lfc = 2, p.value = 0.01, number = Inf)
nrow(strictDELCLvB)
## [1] 1234
# Sort this conservative list of DE genes by Fold Change
sortedDELCLvB <- strictDELCLvB[order(-strictDELCLvB$logFC), ]
head(strictDELCLvB)
## ENTREZID SYMBOL logFC AveExpr t P.Value adj.P.Val
## SRGN 5552 SRGN 5.470268 7.342061 25.53817 2.614729e-11 3.183433e-07
## SLAMF7 57823 SLAMF7 4.950439 6.934772 24.87386 3.499513e-11 4.260657e-07
## MAP3K1 4214 MAP3K1 -5.166142 6.654672 -24.39221 4.343572e-11 5.288299e-07
## BACH2 60468 BACH2 -6.822951 5.048251 -24.99075 3.322775e-11 4.045479e-07
## MTHFD2 10797 MTHFD2 4.743271 5.681443 23.11323 7.870151e-11 9.581909e-07
## RASGRP2 10235 RASGRP2 -7.613796 5.341152 -24.40690 4.314768e-11 5.253230e-07
## B
## SRGN 16.37782
## SLAMF7 16.13188
## MAP3K1 15.85631
## BACH2 15.60256
## MTHFD2 15.34969
## RASGRP2 15.34634
# How could we sort this list by Expression level?
# You may wish to interrogate 1 or 2 of these genes on www.genecards.org or
# www.proteinatlas.org.
# Get our list of MS risk genes. This has been collated from the latest MS GWAS
# paper: https://www.ncbi.nlm.nih.gov/pubmed/31604244
MSRG <- read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/MSRG.txt",
sep = "")
# make our 2x2 contingency table.
# In the LCL dataset the ENTREZ IDs are labelled Gene.ID
isDE_LCLvB <- efit$genes$ENTREZID %in% strictDELCLvB$ENTREZID
isInPathway_MSRG = efit$genes$ENTREZID %in% MSRG$EntrezID
MSRGtab <- table(isDE_LCLvB, isInPathway_MSRG)
MSRGtab
## isInPathway_MSRG
## isDE_LCLvB FALSE TRUE
## FALSE 10791 150
## TRUE 1194 40
# Perform a Chi-square test for this table
chisq.test(MSRGtab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: MSRGtab
## X-squared = 24.053, df = 1, p-value = 9.371e-07
# Check if we meet the assumptions for a Chi-square test
test <- chisq.test(MSRGtab)
test$expected >= 5
## isInPathway_MSRG
## isDE_LCLvB FALSE TRUE
## FALSE TRUE TRUE
## TRUE TRUE TRUE
# Extension. There may be instances where we the expected values for some of the
# cells in our 2x2 table will be less than 5. In these cases it is more
# appropriate to a a Fishers Exact Test/hypergeometric test. We can run a Fishers
# Exact Test using our contingency table:
# fisher.test(MSRGtab)
## Extension. Consruct a heatmap showing the expression of the Ms risk genes in
## this dataset.
The camera
function is part of the limma
package and can be used to test whether a set of genes is highly ranked relative to ther genes in terms of differential expression, accounting for inter-gene correlation
# We will perform our camera test against a list of gene sets from the Broad
# Institute’s MSigDB c2 collection. The C2 gene sets have been curated from
# online databases, publications and domain experts.
# You will only have to download this once.
load(url("http://bioinf.wehi.edu.au/software/MSigDB/human_c2_v5p2.rdata"))
# We need to make indicies to match these gene sets to the genes in our
# expression data (v)
idx <- ids2indices(Hs.c2, id = v$genes$ENTREZID)
cam.LCLB <- camera(v, idx, design, contrast = "groupLCL")
head(cam.LCLB, 5)
## NGenes Direction PValue
## SOTIRIOU_BREAST_CANCER_GRADE_1_VS_3_UP 136 Up 9.798764e-26
## ROSTY_CERVICAL_CANCER_PROLIFERATION_CLUSTER 129 Up 8.250145e-22
## KOBAYASHI_EGFR_SIGNALING_24HR_DN 224 Up 1.307637e-17
## WHITEFORD_PEDIATRIC_CANCER_MARKERS 103 Up 2.195984e-17
## GRAHAM_NORMAL_QUIESCENT_VS_NORMAL_DIVIDING_DN 80 Up 2.707466e-17
## FDR
## SOTIRIOU_BREAST_CANCER_GRADE_1_VS_3_UP 4.629916e-22
## ROSTY_CERVICAL_CANCER_PROLIFERATION_CLUSTER 1.949097e-18
## KOBAYASHI_EGFR_SIGNALING_24HR_DN 2.059528e-14
## WHITEFORD_PEDIATRIC_CANCER_MARKERS 2.157771e-14
## GRAHAM_NORMAL_QUIESCENT_VS_NORMAL_DIVIDING_DN 2.157771e-14
# The MSigBD c2 list contains some quite specific gene sets that may not be
# relevant to our data. We can filter our camera result to just include pathways
# coming from KEGG/REACTOME/BIOCARTA. These databases contain a smaler set of
# defined/canonical pathways.
cam.LCLB$pathway <- row.names(cam.LCLB)
trimmedCam_LCLvB <- dplyr::filter(cam.LCLB, grepl("KEGG|REACTOME|BIOCARTA", pathway))
head(trimmedCam_LCLvB)
## NGenes Direction PValue FDR
## 1 64 Up 1.699567e-15 6.256862e-13
## 2 66 Up 1.597187e-14 3.773354e-12
## 3 104 Up 3.292503e-13 5.185693e-11
## 4 89 Up 3.729893e-13 5.685078e-11
## 5 77 Up 5.573041e-13 8.228943e-11
## 6 100 Up 7.190446e-13 9.707101e-11
## pathway
## 1 REACTOME_RESPIRATORY_ELECTRON_TRANSPORT
## 2 REACTOME_RESPIRATORY_ELECTRON_TRANSPORT_ATP_SYNTHESIS_BY_CHEMIOSMOTIC_COUPLING_AND_HEAT_PRODUCTION_BY_UNCOUPLING_PROTEINS_
## 3 REACTOME_G1_S_TRANSITION
## 4 REACTOME_SYNTHESIS_OF_DNA
## 5 REACTOME_M_G1_TRANSITION
## 6 REACTOME_TCA_CYCLE_AND_RESPIRATORY_ELECTRON_TRANSPORT
# We can also use this filtering to pull down ownly the pathways that are
# enriched for downregulated genes (same can be done for the pathways enriched in
# upregulated genes).
downCam_LCLvB <- dplyr::filter(trimmedCam_LCLvB, grepl("Down", Direction))
head(downCam_LCLvB)
## NGenes Direction PValue FDR
## 1 300 Down 1.210967e-05 0.0002111372
## 2 77 Down 5.748356e-04 0.0054213541
## 3 23 Down 8.846452e-04 0.0076088087
## 4 14 Down 2.796523e-03 0.0194603415
## 5 16 Down 2.805291e-03 0.0194926474
## 6 12 Down 4.574839e-03 0.0286685875
## pathway
## 1 REACTOME_GENERIC_TRANSCRIPTION_PATHWAY
## 2 KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION
## 3 BIOCARTA_MYOSIN_PATHWAY
## 4 KEGG_LINOLEIC_ACID_METABOLISM
## 5 REACTOME_VOLTAGE_GATED_POTASSIUM_CHANNELS
## 6 REACTOME_INTERACTION_BETWEEN_L1_AND_ANKYRINS
# We can make a barcode plot to show where genes in a pathway of interest sit in
# our gene expression data sorted by differential expression.
barcodeplot(efit$t[, "groupLCL"], index = idx$KOBAYASHI_EGFR_SIGNALING_24HR_DN, main = "EGFR signalling genes in LCLvB")
# WEHI also hosts the hallmark MSigDB gene sets which are selected to have
# well-defined biological states or processes. The hallmark lists can be
# downloaded by
# load(url('http://bioinf.wehi.edu.au/software/MSigDB/human_H_v5p2.rdata')). They
# also the host Gene Ontology (GO) gene sets:
# load(url('http://bioinf.wehi.edu.au/software/MSigDB/human_c5_v5p2.rdata'))
## Extension. See if any of the hallmark/GO gene sets are enriched in our dataset.
The following data comes from a manuscript “A distinct influenza infection signature in the blood transcriptome of patients with severe community-acquired pneumonia”.https://www.ncbi.nlm.nih.gov/pubmed/22898401 .
Diagnosis of severe influenza pneumonia remains challenging because of a lack of correlation between the presence of influenza virus and clinical status. Gene-expression profiling was conducted on the whole blood of critically ill patients to identify a gene signature that would allow clinicians to distinguish influenza infection from other causes of severe respiratory failure, such as bacterial pneumonia, and noninfective systemic inflammatory response syndrome. Samples were assayed on Illumina HT-12 gene-expression beadarrays.
# Retrieve dataset from GEO. getGEO gives you a list of expressionsets, as we
# only have one, we might as well just pull out that one (if (length(gset) > 1)
# idx <- grep('GPL6947', attr(gset, 'names')) else idx <- 1 | gset <-
# gset[[idx]])
dir.create("GSE40012")
## Warning in dir.create("GSE40012"): 'GSE40012' already exists
gset <- getGEO("GSE40012", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = "GSE40012")[[1]]
# Have a look at the summary info for the ExpressionSet gset
# This dataset has many variable labels (including phenotype information). The
# phenotype data in an expressionset is stored in pData. Let's use the phenotype
# data to pull out just the day 1 samples
pheno <- pData(gset)
# colnames(pData(gset))
D1 <- gset[, pheno$`day:ch1` == "1"]
# Phenotype data
pheno <- pData(D1)
# The expression values are stored within 'exprs'. For this dataset the
# expression values have already been normalised using Quantile normalisation,
# but they have not been log transformed. So we will log transform the expression
# values in D1.
exprs(D1) <- log2(exprs(D1))
expVal <- exprs(D1)
# Each probe of the microarray has detailed annotation. To see the specific
# annotation levels we can type fData. Currently the rownames of our expression
# data is the Illumina probe id. This is not convenient for pulling down a gene
# symbol of interest. More gene ID types can be accessed with fData
geneAnno <- fData(D1)
# I wish to shorten the sample types to say bact, flu, mixed, SIRS, HC, As they
# are they are too long to easily display on heatmaps etc.
sampleType <- pheno$`sample type:ch1`
shortNames <- sub("healthy control", "HC", sampleType)
shortNames <- sub("systemic inflammatory response", "SIRS", shortNames)
shortNames <- sub("mixed bacterial and influenza A pneumonia", "mixed", shortNames)
shortNames <- sub("influenza A pneumonia", "flu", shortNames)
shortNames <- sub("bacterial pneumonia", "bact", shortNames)
shortNames
## [1] "SIRS" "bact" "bact" "SIRS" "bact" "bact" "flu" "flu" "bact"
## [10] "flu" "flu" "SIRS" "SIRS" "bact" "HC" "HC" "HC" "HC"
## [19] "HC" "HC" "HC" "HC" "HC" "HC" "HC" "HC" "HC"
## [28] "HC" "HC" "HC" "HC" "HC" "mixed" "SIRS" "SIRS" "SIRS"
## [37] "bact" "bact" "bact" "SIRS" "flu" "flu" "SIRS" "bact" "flu"
## [46] "mixed" "bact" "bact" "bact" "SIRS" "SIRS" "flu" "bact" "bact"
## [55] "SIRS" "mixed" "bact"
# store this phenotype information as a data frame
pheno$group <- shortNames
# Lets take a first look at the dataset. We can perform a PCA analysis.
useGenes <- rowMeans(expVal) > 7
pca <- prcomp(t(expVal[useGenes, ]), scale = TRUE)
autoplot(pca, pheno, colour = "group", size = 2)
autoplot(pca, x = 3, y = 4, pheno, colour = "group", size = 2)
# PCA extension!!! Start again and perform a PCA on the dataset that has up to 5
# days for each patient
# As we have simplified our dataset our analysis model is fairly simple, just
# comparing between the 2 groups
design <- model.matrix(~0 + group, pheno)
design
## groupbact groupflu groupHC groupmixed groupSIRS
## GSM983405 0 0 0 0 1
## GSM983407 1 0 0 0 0
## GSM983416 1 0 0 0 0
## GSM983418 0 0 0 0 1
## GSM983423 1 0 0 0 0
## GSM983428 1 0 0 0 0
## GSM983429 0 1 0 0 0
## GSM983434 0 1 0 0 0
## GSM983439 1 0 0 0 0
## GSM983441 0 1 0 0 0
## GSM983445 0 1 0 0 0
## GSM983452 0 0 0 0 1
## GSM983455 0 0 0 0 1
## GSM983456 1 0 0 0 0
## GSM983459 0 0 1 0 0
## GSM983461 0 0 1 0 0
## GSM983463 0 0 1 0 0
## GSM983465 0 0 1 0 0
## GSM983467 0 0 1 0 0
## GSM983469 0 0 1 0 0
## GSM983471 0 0 1 0 0
## GSM983473 0 0 1 0 0
## GSM983475 0 0 1 0 0
## GSM983477 0 0 1 0 0
## GSM983479 0 0 1 0 0
## GSM983481 0 0 1 0 0
## GSM983483 0 0 1 0 0
## GSM983485 0 0 1 0 0
## GSM983487 0 0 1 0 0
## GSM983489 0 0 1 0 0
## GSM983491 0 0 1 0 0
## GSM983493 0 0 1 0 0
## GSM983497 0 0 0 1 0
## GSM983501 0 0 0 0 1
## GSM983505 0 0 0 0 1
## GSM983508 0 0 0 0 1
## GSM983510 1 0 0 0 0
## GSM983514 1 0 0 0 0
## GSM983515 1 0 0 0 0
## GSM983518 0 0 0 0 1
## GSM983524 0 1 0 0 0
## GSM983529 0 1 0 0 0
## GSM983534 0 0 0 0 1
## GSM983539 1 0 0 0 0
## GSM983544 0 1 0 0 0
## GSM983547 0 0 0 1 0
## GSM983551 1 0 0 0 0
## GSM983555 1 0 0 0 0
## GSM983557 1 0 0 0 0
## GSM983562 0 0 0 0 1
## GSM983570 0 0 0 0 1
## GSM983574 0 1 0 0 0
## GSM983577 1 0 0 0 0
## GSM983578 1 0 0 0 0
## GSM983583 0 0 0 0 1
## GSM983585 0 0 0 1 0
## GSM983590 1 0 0 0 0
## attr(,"assign")
## [1] 1 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
colnames(design) <- sub("group", "", colnames(design))
cont.matrix <- makeContrasts(fluvsbact = flu - bact, levels = design)
cont.matrix
## Contrasts
## Levels fluvsbact
## bact -1
## flu 1
## HC 0
## mixed 0
## SIRS 0
# fit the model
fit <- lmFit(D1, design)
# fit the model for the specified contrasts. For us this is only an option of 1,
# but if we had more than 2 groups here we could define what groups to compare
fit2 <- contrasts.fit(fit, cont.matrix)
# Create moderated test statistics using eBayes
efit <- eBayes(fit2)
# Quick look at how many genes are differentially expressed
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)
## fluvsbact
## Down 612
## NotSig 46881
## Up 1261
# How does the number of DE change if we instead use the more stringent
# bonferroni correction?
dt_bonf <- decideTests(efit, adjust.method = "bonferroni", p.value = 0.05)
summary(dt_bonf)
## fluvsbact
## Down 18
## NotSig 48599
## Up 137
# histogram of p value distribution for groupLCL
hist(efit$p.value[, "fluvsbact"])
# volcano plot
volcanoplot(efit, coef = "fluvsbact", highlight = 10, names = geneAnno$`Gene symbol`)
# Generate a table of the genes significantly differentially expressed in
# fluvsbact
DE_fluvsbact <- topTable(efit, coef = "fluvsbact", adjust.method = "fdr", p.value = 0.05,
number = Inf)
# we can write this table to a file
write.csv(DE_fluvsbact, file = "fluvsbact.csv")
# Generate a mean-difference plot (MD-plot)
colnames(efit)
## [1] "fluvsbact"
glMDPlot(efit, coef = 1, status = dt_bonf[, 1], main = colnames(efit)[1], side.main = "UNIQUEID",
counts = expVal, groups = pheno$group, launch = TRUE)
# Get a list of the top 40 most DE genes. We will use this to draw a heatmap
topGenes <- rownames(DE_fluvsbact)[1:40]
# pull out the expression data for these 40 genes
i <- which(rownames(D1) %in% topGenes)
coolmap(expVal[i, ], labCol = pheno$group, labRow = DE_fluvsbact$Gene.symbol, margin = c(8,
6), cexRow = 0.7, cexCol = 0.7)
# Now we can pull down our gene of interest by its gene symbol.
useGene <- rownames(expVal)[which(geneAnno$`Gene symbol` == "IFI27")]
df <- data.frame(pheno, IFI27 = expVal[useGene, ])
ggplot(df, aes(x = group, y = IFI27, colour = group)) + geom_boxplot() + geom_point() +
theme_classic()
This list of ISGs was obtained from the following publication. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3409588/ . This is an example of how we can test for over-representation of a curated gene list/list of DE genes from a related study in our data.
ISG <- read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/ISG.txt",
sep = "\t")
# This gene list only has gene symbols. Lets add ENTREZIDs to this list using
# mapIds.
ISG$ENTREZID <- mapIds(Homo.sapiens, keys = as.character(ISG$Gene.symbol), column = ("ENTREZID"),
keytype = "SYMBOL", multiVals = "first")
## 'select()' returned 1:1 mapping between keys and columns
# make our 2x2 contingency table
# In the flu dataset the ENTREZ IDs are labelled Gene.ID
isDE <- efit$genes$ID %in% DE_fluvsbact$ID
isInPathway_ISG = efit$genes$`Gene ID` %in% ISG$ENTREZID
ISGtab <- table(isDE, isInPathway_ISG)
ISGtab
## isInPathway_ISG
## isDE FALSE TRUE
## FALSE 46451 430
## TRUE 1798 75
# Perform a Chi-square test for this table
chisq.test(ISGtab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: ISGtab
## X-squared = 164.44, df = 1, p-value < 2.2e-16
# Check we meet the assumptions for a Chi-square test
test <- chisq.test(ISGtab)
test$expected >= 5
## isInPathway_ISG
## isDE FALSE TRUE
## FALSE TRUE TRUE
## TRUE TRUE TRUE
# Extension. There may be instances where we the expected values for some of the
# cells in our 2x2 table will be less than 5. In these cases it is more
# appropriate to a a Fishers Exact Test/hypergeometric test. We can run a Fishers
# Exact Test using our contingency table:
# fisher.test(ISGtab)
The camera
function is part of the limma
package and can be used to test whether a set of genes is highly ranked relative to ther genes in terms of differential expression, accounting for inter-gene correlation
# We will perform our camera test against a list of gene sets from the Broad
# Institute’s MSigDB c2 collection. The C2 gene sets have been curated from
# online databases, publications and domain experts.
# You will only have to download this once.
load(url("http://bioinf.wehi.edu.au/software/MSigDB/human_c2_v5p2.rdata"))
# We need to make indicies to match these gene sets to the genes in our
# expression data (v)
idx <- ids2indices(Hs.c2, id = efit$genes$`Gene ID`)
cam.fluVsBact <- camera(exprs(D1), idx, design, contrast = cont.matrix[, 1])
head(cam.fluVsBact, 5)
## NGenes Direction PValue
## GRAHAM_NORMAL_QUIESCENT_VS_NORMAL_DIVIDING_DN 129 Up 1.884114e-58
## SOTIRIOU_BREAST_CANCER_GRADE_1_VS_3_UP 211 Up 4.763989e-53
## KANG_DOXORUBICIN_RESISTANCE_UP 78 Up 1.613472e-52
## ROSTY_CERVICAL_CANCER_PROLIFERATION_CLUSTER 206 Up 1.043450e-49
## CROONQUIST_IL6_DEPRIVATION_DN 154 Up 1.414259e-45
## FDR
## GRAHAM_NORMAL_QUIESCENT_VS_NORMAL_DIVIDING_DN 8.909977e-55
## SOTIRIOU_BREAST_CANCER_GRADE_1_VS_3_UP 1.126445e-49
## KANG_DOXORUBICIN_RESISTANCE_UP 2.543370e-49
## ROSTY_CERVICAL_CANCER_PROLIFERATION_CLUSTER 1.233619e-46
## CROONQUIST_IL6_DEPRIVATION_DN 1.337607e-42
# The MSigBD c2 list contains some quite specific gene sets that may not be
# relevant to our data. We can filter our camera result to just include pathways
# coming from KEGG/REACTOME/BIOCARTA. These databases contain a smaler set of
# defined/canonical pathways.
cam.fluVsBact$pathway <- row.names(cam.fluVsBact)
trimmedCam_fluVsBact <- dplyr::filter(cam.fluVsBact, grepl("KEGG|REACTOME|BIOCARTA",
pathway))
head(trimmedCam_fluVsBact)
## NGenes Direction PValue FDR
## 1 72 Up 1.027524e-25 1.104355e-23
## 2 53 Up 2.768964e-25 2.909874e-23
## 3 48 Up 6.905666e-24 6.047573e-22
## 4 61 Up 2.180679e-23 1.747870e-21
## 5 44 Up 5.456161e-23 4.229867e-21
## 6 288 Up 3.531181e-21 2.420138e-19
## pathway
## 1 REACTOME_G2_M_CHECKPOINTS
## 2 KEGG_DNA_REPLICATION
## 3 REACTOME_DNA_STRAND_ELONGATION
## 4 REACTOME_ACTIVATION_OF_ATR_IN_RESPONSE_TO_REPLICATION_STRESS
## 5 REACTOME_ACTIVATION_OF_THE_PRE_REPLICATIVE_COMPLEX
## 6 REACTOME_DNA_REPLICATION
# We can also use this filtering to pull down ownly the pathways that are
# enriched for downregulated genes (same can be done for the pathways enriched in
# upregulated genes).
downCam_fluVsBact <- dplyr::filter(trimmedCam_fluVsBact, grepl("Down", Direction))
head(downCam_fluVsBact)
## NGenes Direction PValue FDR
## 1 39 Down 3.664485e-05 0.0003239131
## 2 76 Down 4.312942e-05 0.0003749247
## 3 32 Down 8.121147e-05 0.0006773352
## 4 50 Down 2.523134e-04 0.0018356771
## 5 72 Down 4.757641e-04 0.0033038009
## 6 46 Down 5.316315e-04 0.0036383288
## pathway
## 1 BIOCARTA_HCMV_PATHWAY
## 2 REACTOME_IL_3_5_AND_GM_CSF_SIGNALING
## 3 REACTOME_REGULATION_OF_SIGNALING_BY_CBL
## 4 REACTOME_RECYCLING_PATHWAY_OF_L1
## 5 REACTOME_IL_2_SIGNALING
## 6 REACTOME_LATENT_INFECTION_OF_HOMO_SAPIENS_WITH_MYCOBACTERIUM_TUBERCULOSIS
# We can make a barcode plot to show where genes in a pathway of interest sit in
# our gene expression data sorted by differential expression.
barcodeplot(efit$t[, "fluvsbact"], index = idx$KOBAYASHI_EGFR_SIGNALING_24HR_DN,
main = "EGFR signalling genes in fluvsbact")
# WEHI also hosts the hallmark MSigDB gene sets which are selected to have
# well-defined biological states or processes. The hallmark lists can be
# downloaded by
# load(url('http://bioinf.wehi.edu.au/software/MSigDB/human_H_v5p2.rdata')). They
# also the host Gene Ontology (GO) gene sets:
# load(url('http://bioinf.wehi.edu.au/software/MSigDB/human_c5_v5p2.rdata'))
## Extension. See if any of the hallmark/GO gene sets are enriched in our dataset.
You may wish to browse the RNA-seq 1-2-3 protocol published in F1000Research which performs an RNAseq analysis on different cell populations harvested from mice mammary gland tissue. You could work through the code in RStudio step by step for extra practice. Note that their workflow will proceed all the way to differential gene expression testing and pathway analysis which we will not visit until weeks 8 and 9. See https://f1000research.com/articles/5-1408 .
Only this section needs to be included in your report. This report will be due at the end of Week 9, Friday 11:59pm 1st May.
While I expect that you should explore the TCGA melanoma data, I am not opposed to you submiting a report on a dataset that you are incredibly engaged with.
Report purpose
The purpose of this report is to:
Provide you with an opportunity to apply the methods and skills you have been exposed to in class.
Create a document that can aid you when asking for feedback from tutors.
Practice creating reports for your mid-semester exam.
Because of this, do not feel restricted to using the dataset I have provided. Feel free to download any dataset that you find interesting and apply what you have learnt in class.
Report guidelines
There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:
It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”
Marking criteria
Lab instructions
There are at least 200 forms of cancer, and many more subtypes. Each of these is caused by errors in DNA that cause cells to grow uncontrolled. Identifying the changes in each cancer’s complete set of DNA - its genome - and understanding how such changes interact to drive the disease will lay the foundation for improving cancer prevention, early detection and treatment.
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, 2.5 petabytes of data describing tumor tissue and matched normal tissues from more than 11,000 patients, is publically available and has been used widely by the research community. The data have contributed to more than a thousand studies of cancer by independent researchers and to the TCGA research network publications.
This data can be downloaded from the data portal in TCGA (https://portal.gdc.cancer.gov/) but is in an aweful format. Thankfully, the team at Recount2 https://jhubiostatistics.shinyapps.io/recount/ have reprocessed all of the TCGA data. In the following, I have code to download the TCGA melanoma data. Many of these code lines only need to be run once.
Read in the data and experiment with normalization.
Can you see any structure in a PCA?
Is this structure explained by any variables in the clinical data?
Let’s start by reading in the data.
### Install recount from Bioconductor, do this once only!!!!
### install.packages('BiocManager') BiocManager::install('recount')
library(recount)
library(dplyr)
### Browse the vignetets for a quick description of how to use the package
### browseVignettes('recount')
### Download the RangedSummarizedExperiment object at the gene level for TCGA
### melanoma data. url <-
### 'http://duffel.rail.bio/recount/v2/TCGA/rse_gene_skin.Rdata' View the url for
### the file by printing the object url url Download the file download.file(url =
### url, destfile = 'rse_gene_skin.Rdata', method = 'auto', mode = 'wb')
### Load the data
load("rse_gene_skin.Rdata")
### rse_gene is a RangesSummarizedExperiment
rse_gene
### Gene counts
counts <- assay(rse_gene)
### Gene identifiers
genes <- rowData(rse_gene)
### Phenotype data for the samples
samples <- as.data.frame(colData(rse_gene))
### There are 864 variables in samples, 593 are completely missing
samples %>% is.na() %>% colSums() %>% table()