Aims:
Perform a differential expression analysis.
Correct for multiple hypothesis testing.
Visualise results.
Please make an effort to complete your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations. Try to use your expanding knowledge of R to modify this example code in places where you think it can be altered/improved. Ultimately, this workflow will go from obtaining data from GEO through to pathway analysis, however we will complete the workflow in 3 stages. In week 7 we will focus on obtaining the data and getting through to PCA analysis. In week 8 we will perfom differential gene expression analysis and some visualisations, and in week 9 we will get to the pathway analysis. The workflow uses popular bioconductor packages and is adapted from the published RNAseq 1-2-3 workflow (https://f1000research.com/articles/5-1408).
The following data comes from a manuscript “Evidence from genome wide association studies implicates reduced control of Epstein-Barr virus infection in multiple sclerosis susceptibility” (https://www.ncbi.nlm.nih.gov/pubmed/31039804). There were multiple goals for the RNAseq component of project:
Identify the gene expression changes that occur when B cells are infected with EBV and transformed into LCLs.
Determine which pathways are over-represented in the differentially expressed genes.
Assess if there are more MS risk genes then expected by chance in the differentially expressed genes.
Use this dataset as an opportunity to explore the behaviour of a small RNA-Seq experiment. In this lab we will start by just processing and exploring the data.
IMPORTANT: The code from week 8 continues on from the code generated in week 7. The new steps introduced this week start from step 5.
1A. Install / load required R packages
library(tidyverse)
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.1.3
library(RColorBrewer)
# If any packages of these required are not installed, they can be installed by
# first installing bioconductor: if (!requireNamespace('BiocManager', quietly =
# TRUE)) install.packages('BiocManager') then using
# BiocManager::install('PACKAGENAME')
# load packages to be used in the analysis
library(GEOquery)
library(org.Hs.eg.db)
## Warning: package 'S4Vectors' was built under R version 4.1.1
library(limma)
library(edgeR)
## Warning: package 'edgeR' was built under R version 4.1.1
library(Glimma)
library(Homo.sapiens)
## Warning: package 'GenomicFeatures' was built under R version 4.1.1
## Warning: package 'GenomeInfoDb' was built under R version 4.1.1
library(gplots)
library(dplyr)
1B. Read in the data.
# The count data for this GEO submission is available as a supplementary file
# on GEO. You only want to run this once...
sfiles <- getGEOSuppFiles("GSE126379")
fnames <- rownames(sfiles)
# There is only one supplemental file
Data = read.delim(fnames[1], header = TRUE)
head(Data)
## Chromosome Gene.Name Gene.ID Sum.of.Exon.Length.of.Gene
## 1 chr10 A1CF A1CF 9529
## 2 chr10 ABCC2 ABCC2 5051
## 3 chr10 ABI1 ABI1 3776
## 4 chr10 ABLIM1 ABLIM1 8303
## 5 chr10 ACADSB ACADSB 5941
## 6 chr10 ACBD5 ACBD5 4066
## Reads.Counts..BLCL_A1. RPKM..BLCL_A1. Reads.Counts..BLCL_A2. RPKM..BLCL_A2.
## 1 2 0.0113048 1 0.009967274
## 2 21 0.2239352 42 0.789760929
## 3 1975 28.1718559 1436 36.119869850
## 4 786 5.0987985 2619 29.958784090
## 5 756 6.8539761 543 8.680884936
## 6 535 7.0870705 421 9.834186588
## Reads.Counts..BLCL_B1. RPKM..BLCL_B1. Reads.Counts..BLCL_B2. RPKM..BLCL_B2.
## 1 1 0.01277064 4 0.02185651
## 2 10 0.24092549 89 0.91744656
## 3 563 18.14414625 2307 31.81147083
## 4 220 3.22439153 4464 27.99350563
## 5 210 4.30149944 877 7.68613807
## 6 200 5.98580756 674 8.63099098
## Reads.Counts..BLCL_C1. RPKM..BLCL_C1. Reads.Counts..BLCL_C2. RPKM..BLCL_C2.
## 1 0 0.0000000 11 0.122567
## 2 13 0.2041644 63 1.324315
## 3 695 14.6004734 1235 34.726663
## 4 400 3.8215458 2600 33.248066
## 5 264 3.5249949 568 10.151188
## 6 304 5.9308963 492 12.847712
## Reads.Counts..BLCL_D1. RPKM..BLCL_D1. Reads.Counts..BLCL_D2. RPKM..BLCL_D2.
## 1 2 0.02147553 10 0.04834247
## 2 25 0.50643522 163 1.48657342
## 3 847 22.95158503 2806 34.23196323
## 4 626 7.71437158 5804 32.20093570
## 5 272 4.68457946 1065 8.25783462
## 6 250 6.29120581 1139 12.90424131
## Reads.Counts..BLCL_E1. RPKM..BLCL_E1. Reads.Counts..BLCL_E2. RPKM..BLCL_E2.
## 1 1 0.01387007 7 0.04338079
## 2 54 1.41300065 130 1.51989185
## 3 986 34.51206142 2471 38.64447949
## 4 177 2.81750179 4484 31.89167049
## 5 302 6.71851610 1039 10.32767820
## 6 313 10.17426241 932 13.53615256
# Next time you can just run... fnames <-
# 'GSE126379/GSE126379_BLCL_processed.txt.gz' Data =
# read.delim(fnames[1],header=TRUE) head(Data)
# MANIPULATE GENE ANNOTATIONS
# Genes have multiple 'identifiers' in different databases. Later on we will
# need identifiers from the ENTREZ database. To save us some pain later, lets
# extract these now using the select() function from the AnnotationDbi package.
# The ENTREZ identifiers can be found in the Homo.sapiens package.
ENTREZID <- mapIds(Homo.sapiens, keys = as.character(Data$Gene.Name), column = c("ENTREZID"),
keytype = "SYMBOL", multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
genes <- data.frame(ENTREZID, SYMBOL = Data$Gene.Name)
# In this file of counts, there are multiple instances where the same gene
# symbol or entrez id appears in multiple rows. This may be due to how it was
# initially generated but as we are not able to distinguish between them (and
# in most cases the count information is very similar for the duplicate rows,
# we will just ask for just one row to be kept for any given gene name)
chooseGenes <- which(!duplicated(genes$SYMBOL) & !duplicated(genes$ENTREZID))
genes <- genes[chooseGenes, ]
Data <- Data[chooseGenes, ]
# MANIPULATE GENE COUNTS
# We do not need the RPKMs just the read counts, so lets just pull out these
# columns
colnames(Data)
## [1] "Chromosome" "Gene.Name"
## [3] "Gene.ID" "Sum.of.Exon.Length.of.Gene"
## [5] "Reads.Counts..BLCL_A1." "RPKM..BLCL_A1."
## [7] "Reads.Counts..BLCL_A2." "RPKM..BLCL_A2."
## [9] "Reads.Counts..BLCL_B1." "RPKM..BLCL_B1."
## [11] "Reads.Counts..BLCL_B2." "RPKM..BLCL_B2."
## [13] "Reads.Counts..BLCL_C1." "RPKM..BLCL_C1."
## [15] "Reads.Counts..BLCL_C2." "RPKM..BLCL_C2."
## [17] "Reads.Counts..BLCL_D1." "RPKM..BLCL_D1."
## [19] "Reads.Counts..BLCL_D2." "RPKM..BLCL_D2."
## [21] "Reads.Counts..BLCL_E1." "RPKM..BLCL_E1."
## [23] "Reads.Counts..BLCL_E2." "RPKM..BLCL_E2."
counts <- Data[, grep("Counts", colnames(Data))]
# LCL are labelled 1, B cell labelled 2. Lets rename the columns to reflect
# this
colnames(counts) <- sub("Reads.Counts..BLCL_", "", colnames(counts))
colnames(counts) <- sub("1", "LCL", colnames(counts))
colnames(counts) <- sub("2", "CD19B", colnames(counts))
colnames(counts) <- sub("CD19B.", "_CD19B", colnames(counts))
colnames(counts) <- sub("LCL.", "_LCL", colnames(counts))
colnames(counts)
## [1] "A_LCL" "A_CD19B" "B_LCL" "B_CD19B" "C_LCL" "C_CD19B" "D_LCL"
## [8] "D_CD19B" "E_LCL" "E_CD19B"
# make the Gene IDs the name of the rows
rownames(counts) <- genes$SYMBOL
# MANIPULATE SAMPLE ANNOTATIONS
# This is a function that will be in the next release of tidyverse. We can use
# it to split a string on a certain pattern and then select things that are
# before or after than split
str_split_n <- function(string, pattern, n) {
out <- str_split(string, pattern, simplify = TRUE)
apply(out, 1, `[`, i = n)
}
# use sample names to make factors for group and individual
group <- str_split_n(colnames(counts), "_", 2) %>%
factor()
individual <- str_split_n(colnames(counts), "_", 1) %>%
factor()
# put this factor information into a data frame
samples <- data.frame(group, individual)
samples
## group individual
## 1 LCL A
## 2 CD19B A
## 3 LCL B
## 4 CD19B B
## 5 LCL C
## 6 CD19B C
## 7 LCL D
## 8 CD19B D
## 9 LCL E
## 10 CD19B E
# MAKE DGEList
# make DGEList object to connect gene counts with sample information
BLCL <- DGEList(counts = counts, samples = samples, genes = genes)
BLCL
## An object of class "DGEList"
## $counts
## A_LCL A_CD19B B_LCL B_CD19B C_LCL C_CD19B D_LCL D_CD19B E_LCL E_CD19B
## A1CF 2 1 1 4 0 11 2 10 1 7
## ABCC2 21 42 10 89 13 63 25 163 54 130
## ABI1 1975 1436 563 2307 695 1235 847 2806 986 2471
## ABLIM1 786 2619 220 4464 400 2600 626 5804 177 4484
## ACADSB 756 543 210 877 264 568 272 1065 302 1039
## 19731 more rows ...
##
## $samples
## group lib.size norm.factors individual
## A_LCL LCL 16983549 1 A
## A_CD19B CD19B 9547232 1 A
## B_LCL LCL 7512768 1 B
## B_CD19B CD19B 17456784 1 B
## C_LCL LCL 11519981 1 C
## C_CD19B CD19B 8466787 1 C
## D_LCL LCL 8904409 1 D
## D_CD19B CD19B 19546860 1 D
## E_LCL LCL 6700254 1 E
## E_CD19B CD19B 15214868 1 E
##
## $genes
## ENTREZID SYMBOL
## A1CF 29974 A1CF
## ABCC2 1244 ABCC2
## ABI1 10006 ABI1
## ABLIM1 3983 ABLIM1
## ACADSB 36 ACADSB
## 19731 more rows ...
# COUNTS NORMALIZATION
# First we calculate normalization factors
BLCL <- calcNormFactors(BLCL, method = "TMM")
# make a copy of BLCLDGEList to use in extension tab.
BLCL2 <- BLCL
# show the nomalisation factors. For this dataset the effect of
# TMM-normalisation is mild, as evident in the magnitude of the scaling
# factors, which are all relatively close to 1.
BLCL$samples$norm.factors
## [1] 1.0160334 1.0127690 0.9140066 0.9548327 0.9229527 1.0996109 0.8942405
## [8] 1.0524337 1.1074949 1.0526784
# Next we will keep genes with large enough counts to be considered useful.
keep <- filterByExpr(BLCL)
BLCL <- BLCL[keep, ]
# Calculate the counts per million reads for each gene for all samples
cpm <- cpm(BLCL)
# now calculate the log of these CPM values
lcpm <- cpm(BLCL, log = TRUE)
# Perform PCA
pca <- prcomp(t(lcpm), scale = TRUE)
# Plot results
autoplot(pca, data = samples, colour = "group", shape = "individual")
# Another method for dimension reduction is called MDS.
plotMDS(lcpm, labels = group)
plotMDS(lcpm, labels = individual)
# I also like this cool interactive version of a MDS plot which can be
# generated using Glimma
glMDSPlot(lcpm, labels = paste(group, individual, sep = "_"), groups = BLCL$samples[,
c("group", "individual")], launch = TRUE)
limma
is a versatile package for DE analysis of microarray data. However, the count based data from RNAseq has certain stastical properties that are different to microarray based data, namely that in RNAseq count data, the variance is not independent of the mean. The authors of limma
delevoped the voom
function which accounts for this and modifies the RNAseq count data for use with the limma
package.
# We first need to set up our design matrix, where we tell limma that we are
# interested in fitting the gene expression data against group, but also to
# include the 'individual' term, as this will account for the samples being
# paired.
design <- model.matrix(~individual + group)
design
## (Intercept) individualB individualC individualD individualE groupLCL
## 1 1 0 0 0 0 1
## 2 1 0 0 0 0 0
## 3 1 1 0 0 0 1
## 4 1 1 0 0 0 0
## 5 1 0 1 0 0 1
## 6 1 0 1 0 0 0
## 7 1 0 0 1 0 1
## 8 1 0 0 1 0 0
## 9 1 0 0 0 1 1
## 10 1 0 0 0 1 0
## attr(,"assign")
## [1] 0 1 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$individual
## [1] "contr.treatment"
##
## attr(,"contrasts")$group
## [1] "contr.treatment"
# Voom starts with the count data, transforms it into logCPM, estimates the
# mean-variance relationship, and uses this to compute appropriate
# observation-level weights. The data can then be used for linear modelling in
# the same way as microarray data. voom generates an EList object, which is
# similar to the DGEList-object
v <- voom(BLCL, design, plot = TRUE)
# The voom plot shows that whilst not strong, there is some relationship
# between variance and expression level in this data.
v
## An object of class "EList"
## $genes
## ENTREZID SYMBOL
## ABCC2 1244 ABCC2
## ABI1 10006 ABI1
## ABLIM1 3983 ABLIM1
## ACADSB 36 ACADSB
## ACBD5 91452 ACBD5
## 12063 more rows ...
##
## $targets
## group lib.size norm.factors individual
## A_LCL LCL 17255853 1.0160334 A
## A_CD19B CD19B 9669140 1.0127690 A
## B_LCL LCL 6866719 0.9140066 B
## B_CD19B CD19B 16668308 0.9548327 B
## C_LCL LCL 10632397 0.9229527 C
## C_CD19B CD19B 9310172 1.0996109 C
## D_LCL LCL 7962683 0.8942405 D
## D_CD19B CD19B 20571775 1.0524337 D
## E_LCL LCL 7420497 1.1074949 E
## E_CD19B CD19B 16016363 1.0526784 E
##
## $E
## A_LCL A_CD19B B_LCL B_CD19B C_LCL C_CD19B D_LCL
## ABCC2 0.3172508 2.136003 0.6126962 2.424780 0.3444923 2.769877 1.679171
## ABI1 6.8389882 7.214954 6.3586506 7.113079 6.0315115 7.052071 6.733815
## ABLIM1 5.5102890 8.081688 5.0050136 8.065247 5.2352633 8.125765 6.297916
## ACADSB 5.4541823 5.812748 4.9380552 5.718219 4.6367287 5.932209 5.096858
## ACBD5 4.9557288 5.446001 4.8678372 5.338639 4.8399033 5.725172 4.975412
## D_CD19B E_LCL E_CD19B
## ABCC2 2.990552 2.876668 3.026431
## ABI1 7.091962 7.054659 7.269696
## ABLIM1 8.140362 4.580159 8.129257
## ACADSB 5.694720 5.349275 6.020199
## ACBD5 5.791591 5.400806 5.863485
## 12063 more rows ...
##
## $weights
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 2.319278 2.965328 1.646495 4.461806 2.045538 3.293177 2.313182 6.704464
## [2,] 9.353846 9.566680 9.803855 9.164004 10.006231 9.890381 9.998514 8.609606
## [3,] 9.969749 8.716015 7.726256 8.093896 9.193867 8.919808 9.362190 7.169634
## [4,] 9.893986 9.828794 7.327569 9.996411 8.547293 9.433528 7.927412 9.832933
## [5,] 9.579133 9.314521 6.965230 9.949799 8.658099 9.352200 8.024839 9.874966
## [,9] [,10]
## [1,] 2.789916 7.255031
## [2,] 9.993404 8.746299
## [3,] 7.578734 8.368473
## [4,] 8.314969 9.876232
## [5,] 8.325836 9.935726
## 12063 more rows ...
##
## $design
## (Intercept) individualB individualC individualD individualE groupLCL
## 1 1 0 0 0 0 1
## 2 1 0 0 0 0 0
## 3 1 1 0 0 0 1
## 4 1 1 0 0 0 0
## 5 1 0 1 0 0 1
## 6 1 0 1 0 0 0
## 7 1 0 0 1 0 1
## 8 1 0 0 1 0 0
## 9 1 0 0 0 1 1
## 10 1 0 0 0 1 0
## attr(,"assign")
## [1] 0 1 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$individual
## [1] "contr.treatment"
##
## attr(,"contrasts")$group
## [1] "contr.treatment"
# Linear modelling in limma is carried out using the 'lmFit' and
# 'contrasts.fit' functions. These functions fit a separate model to the
# expression values for each gene. We now tell limma to fit the model specified
# in our design matrix for the voom transformed data
fit <- lmFit(v, design)
# Next, empirical Bayes moderation is carried out by borrowing information
# across all genes to obtain more precise estimates of gene-wise variability.
efit <- eBayes(fit)
# Quick look at how many genes are differentially expressed
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)
## (Intercept) individualB individualC individualD individualE groupLCL
## Down 350 1 0 1 10 4192
## NotSig 936 12066 12062 12065 12049 3583
## Up 10782 1 6 2 9 4293
# The coefficient we are interested in is 'groupLCL'. Does the number of DE
# genes change if we instead use the more stringent bonferroni correction?
dt_bonf <- decideTests(efit, adjust.method = "bonferroni", p.value = 0.05)
summary(dt_bonf)
## (Intercept) individualB individualC individualD individualE groupLCL
## Down 33 1 0 1 5 794
## NotSig 3081 12067 12066 12066 12061 10062
## Up 8954 0 2 1 2 1212
# histogram of p value distribution for groupLCL colnames(efit)
hist(efit$p.value[, "groupLCL"])
# volcano plot of results for groupLCL
volcanoplot(efit, coef = "groupLCL", highlight = 10, names = efit$genes$SYMBOL)
# Generate a table of the genes significantly differentially expressed in LCLs
# v B cells (groupLCL)
DELCLvB <- topTable(efit, coef = "groupLCL", adjust.method = "fdr", p.value = 0.05,
number = Inf)
# we can write this table to a file
write.csv(DELCLvB, file = "DELCLvB.csv")
# Generate a mean-difference plot (MD-plot) for the group comparison (groupLCL,
# which is column 6 of the efit object)
useCoef <- which(colnames(efit) == "groupLCL")
plotMD(efit, column = useCoef, status = dt_fdr[, useCoef], main = colnames(efit)[useCoef],
)
# we can also generate an interactive MDplot using Glimma
glMDPlot(efit, coef = useCoef, status = dt_fdr[, useCoef], main = colnames(efit)[useCoef],
side.main = "ENTREZID", counts = lcpm, groups = group, launch = TRUE)
# Get a list of the top 40 most DE genes. We will use this to draw a heatmap
LCLvB.topgenes <- DELCLvB$SYMBOL[1:40]
# pull out the voom expression data for these 40 genes
i <- which(v$genes$SYMBOL %in% LCLvB.topgenes)
coolmap(lcpm[i, ], margin = c(8, 6), cexRow = 0.7, cexCol = 0.7)
# You could save this with the pdf() function.
pdf("heatmap.pdf", width = 12, height = 16)
coolmap(lcpm[i, ], margin = c(8, 6))
dev.off()
## png
## 2
# You might like to look at the library pheatmap
CD40 <- lcpm["CD40", ]
df <- data.frame(CD40, group, individual)
ggplot(df, aes(x = group, y = CD40, colour = individual)) + geom_point() + geom_line(aes(group = individual)) +
theme_classic()
The following data comes from a manuscript “A distinct influenza infection signature in the blood transcriptome of patients with severe community-acquired pneumonia”.https://www.ncbi.nlm.nih.gov/pubmed/22898401 .
Diagnosis of severe influenza pneumonia remains challenging because of a lack of correlation between the presence of influenza virus and clinical status. Gene-expression profiling was conducted on the whole blood of critically ill patients to identify a gene signature that would allow clinicians to distinguish influenza infection from other causes of severe respiratory failure, such as bacterial pneumonia, and noninfective systemic inflammatory response syndrome. Samples were assayed on Illumina HT-12 gene-expression beadarrays.
# Retrieve dataset from GEO. getGEO gives you a list of expressionsets, as we
# only have one, we might as well just pull out that one (if (length(gset) > 1)
# idx <- grep('GPL6947', attr(gset, 'names')) else idx <- 1 | gset <-
# gset[[idx]])
# run this line if you get a VROOM error
Sys.setenv(VROOM_CONNECTION_SIZE = 131072 * 2)
dir.create("GSE40012")
## Warning in dir.create("GSE40012"): 'GSE40012' already exists
gset <- getGEO("GSE40012", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = "GSE40012")[[1]]
# Have a look at the summary info for the ExpressionSet
gset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 48754 features, 190 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM983403 GSM983404 ... GSM983592 (190 total)
## varLabels: title geo_accession ... tissue:ch1 (41 total)
## varMetadata: labelDescription
## featureData
## featureNames: ILMN_1343291 ILMN_1343295 ... ILMN_2416019 (48754
## total)
## fvarLabels: ID Gene title ... Platform_SEQUENCE (22 total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 22898401
## Annotation: GPL6947
# This dataset has many variable labels (including phenotype information). The
# phenotype data in an expressionset is stored in pData. Let's use the
# phenotype data to pull out just the day 1 samples
pheno <- pData(gset)
colnames(pData(gset))
## [1] "title" "geo_accession"
## [3] "status" "submission_date"
## [5] "last_update_date" "type"
## [7] "channel_count" "source_name_ch1"
## [9] "organism_ch1" "characteristics_ch1"
## [11] "characteristics_ch1.1" "characteristics_ch1.2"
## [13] "characteristics_ch1.3" "characteristics_ch1.4"
## [15] "molecule_ch1" "extract_protocol_ch1"
## [17] "label_ch1" "label_protocol_ch1"
## [19] "taxid_ch1" "hyb_protocol"
## [21] "scan_protocol" "description"
## [23] "data_processing" "platform_id"
## [25] "contact_name" "contact_email"
## [27] "contact_laboratory" "contact_department"
## [29] "contact_institute" "contact_address"
## [31] "contact_city" "contact_state"
## [33] "contact_zip/postal_code" "contact_country"
## [35] "supplementary_file" "data_row_count"
## [37] "day:ch1" "gender:ch1"
## [39] "ID:ch1" "sample type:ch1"
## [41] "tissue:ch1"
D1 <- gset[, pheno$`day:ch1` == "1"]
# Phenotype data
pheno <- pData(D1)
# The expression values are stored within 'exprs'. For this dataset the
# expression values have already been normalised using Quantile normalisation,
# but they have not been log transformed. So we will log transform the
# expression values in D1.
exprs(D1) <- log2(exprs(D1))
expVal <- exprs(D1)
# Each probe of the microarray has detailed annotation. To see the specific
# annotation levels we can type fData. Currently the rownames of our expression
# data is the Illumina probe id. This is not convenient for pulling down a gene
# symbol of interest. More gene ID types can be accessed with fData
geneAnno <- fData(D1)
# I wish to shorten the sample types to say bact, flu, mixed, SIRS, HC, As they
# are they are too long to easily display on heatmaps etc.
sampleType <- pheno$`sample type:ch1`
shortNames <- sub("healthy control", "HC", sampleType)
shortNames <- sub("systemic inflammatory response", "SIRS", shortNames)
shortNames <- sub("mixed bacterial and influenza A pneumonia", "mixed", shortNames)
shortNames <- sub("influenza A pneumonia", "flu", shortNames)
shortNames <- sub("bacterial pneumonia", "bact", shortNames)
shortNames
## [1] "SIRS" "bact" "bact" "SIRS" "bact" "bact" "flu" "flu" "bact"
## [10] "flu" "flu" "SIRS" "SIRS" "bact" "HC" "HC" "HC" "HC"
## [19] "HC" "HC" "HC" "HC" "HC" "HC" "HC" "HC" "HC"
## [28] "HC" "HC" "HC" "HC" "HC" "mixed" "SIRS" "SIRS" "SIRS"
## [37] "bact" "bact" "bact" "SIRS" "flu" "flu" "SIRS" "bact" "flu"
## [46] "mixed" "bact" "bact" "bact" "SIRS" "SIRS" "flu" "bact" "bact"
## [55] "SIRS" "mixed" "bact"
# store this phenotype information as a data frame
pheno$group <- shortNames
# Lets take a first look at the dataset. We can perform a PCA analysis.
useGenes <- rowMeans(expVal) > 7
pca <- prcomp(t(expVal[useGenes, ]), scale = TRUE)
autoplot(pca, pheno, colour = "group", size = 2)
autoplot(pca, x = 3, y = 4, pheno, colour = "group", size = 2)
# PCA extension!!! Start again and perform a PCA on the dataset that has up to
# 5 days for each patient
# As we have simplified our dataset our analysis model is fairly simple, just
# comparing between the 2 groups
design <- model.matrix(~0 + group, pheno)
design
## groupbact groupflu groupHC groupmixed groupSIRS
## GSM983405 0 0 0 0 1
## GSM983407 1 0 0 0 0
## GSM983416 1 0 0 0 0
## GSM983418 0 0 0 0 1
## GSM983423 1 0 0 0 0
## GSM983428 1 0 0 0 0
## GSM983429 0 1 0 0 0
## GSM983434 0 1 0 0 0
## GSM983439 1 0 0 0 0
## GSM983441 0 1 0 0 0
## GSM983445 0 1 0 0 0
## GSM983452 0 0 0 0 1
## GSM983455 0 0 0 0 1
## GSM983456 1 0 0 0 0
## GSM983459 0 0 1 0 0
## GSM983461 0 0 1 0 0
## GSM983463 0 0 1 0 0
## GSM983465 0 0 1 0 0
## GSM983467 0 0 1 0 0
## GSM983469 0 0 1 0 0
## GSM983471 0 0 1 0 0
## GSM983473 0 0 1 0 0
## GSM983475 0 0 1 0 0
## GSM983477 0 0 1 0 0
## GSM983479 0 0 1 0 0
## GSM983481 0 0 1 0 0
## GSM983483 0 0 1 0 0
## GSM983485 0 0 1 0 0
## GSM983487 0 0 1 0 0
## GSM983489 0 0 1 0 0
## GSM983491 0 0 1 0 0
## GSM983493 0 0 1 0 0
## GSM983497 0 0 0 1 0
## GSM983501 0 0 0 0 1
## GSM983505 0 0 0 0 1
## GSM983508 0 0 0 0 1
## GSM983510 1 0 0 0 0
## GSM983514 1 0 0 0 0
## GSM983515 1 0 0 0 0
## GSM983518 0 0 0 0 1
## GSM983524 0 1 0 0 0
## GSM983529 0 1 0 0 0
## GSM983534 0 0 0 0 1
## GSM983539 1 0 0 0 0
## GSM983544 0 1 0 0 0
## GSM983547 0 0 0 1 0
## GSM983551 1 0 0 0 0
## GSM983555 1 0 0 0 0
## GSM983557 1 0 0 0 0
## GSM983562 0 0 0 0 1
## GSM983570 0 0 0 0 1
## GSM983574 0 1 0 0 0
## GSM983577 1 0 0 0 0
## GSM983578 1 0 0 0 0
## GSM983583 0 0 0 0 1
## GSM983585 0 0 0 1 0
## GSM983590 1 0 0 0 0
## attr(,"assign")
## [1] 1 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
colnames(design) <- sub("group", "", colnames(design))
cont.matrix <- makeContrasts(fluvsbact = flu - bact, levels = design)
cont.matrix
## Contrasts
## Levels fluvsbact
## bact -1
## flu 1
## HC 0
## mixed 0
## SIRS 0
# fit the model
fit <- lmFit(D1, design)
# fit the model for the specified contrasts. For us this is only an option of
# 1, but if we had more than 2 groups here we could define what groups to
# compare
fit2 <- contrasts.fit(fit, cont.matrix)
# Create moderated test statistics using eBayes
efit <- eBayes(fit2)
# Quick look at how many genes are differentially expressed
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)
## fluvsbact
## Down 612
## NotSig 46881
## Up 1261
# How does the number of DE change if we instead use the more stringent
# bonferroni correction?
dt_bonf <- decideTests(efit, adjust.method = "bonferroni", p.value = 0.05)
summary(dt_bonf)
## fluvsbact
## Down 18
## NotSig 48599
## Up 137
# histogram of p value distribution for groupLCL
hist(efit$p.value[, "fluvsbact"])
# volcano plot
volcanoplot(efit, coef = "fluvsbact", highlight = 10, names = geneAnno$`Gene symbol`)
# Generate a table of the genes significantly differentially expressed in
# fluvsbact
DE_fluvsbact <- topTable(efit, coef = "fluvsbact", adjust.method = "fdr", p.value = 0.05,
number = Inf)
# we can write this table to a file
write.csv(DE_fluvsbact, file = "fluvsbact.csv")
# Generate a mean-difference plot (MD-plot)
colnames(efit)
## [1] "fluvsbact"
glMDPlot(efit, coef = 1, status = dt_bonf[, 1], main = colnames(efit)[1], side.main = "UNIQUEID",
counts = expVal, groups = pheno$group, launch = TRUE)
# Get a list of the top 40 most DE genes. We will use this to draw a heatmap
topGenes <- rownames(DE_fluvsbact)[1:40]
# pull out the expression data for these 40 genes
i <- which(rownames(D1) %in% topGenes)
coolmap(expVal[i, ], labCol = pheno$group, labRow = DE_fluvsbact$Gene.symbol, margin = c(8,
6), cexRow = 0.7, cexCol = 0.7)
# Now we can pull down our gene of interest by its gene symbol.
useGene <- rownames(expVal)[which(geneAnno$`Gene symbol` == "IFI27")]
df <- data.frame(pheno, IFI27 = expVal[useGene, ])
ggplot(df, aes(x = group, y = IFI27, colour = group)) + geom_boxplot() + geom_point() +
theme_classic()
You may wish to browse the RNA-seq 1-2-3 protocol published in F1000Research which performs an RNAseq analysis on different cell populations harvested from mice mammary gland tissue. You could work through the code in RStudio step by step for extra practice. Note that their workflow will proceed all the way to differential gene expression testing and pathway analysis which we will not visit until weeks 8 and 9. See https://f1000research.com/articles/5-1408 .
Only this section needs to be included in your report. This report will be due at the end of Week 9, Friday 11:59pm 29th April.
While I expect that you should explore the TCGA melanoma data, I am not opposed to you submiting a report on a dataset that you are incredibly engaged with.
Report purpose
The purpose of this report is to:
Provide you with an opportunity to apply the methods and skills you have been exposed to in class.
Create a document that can aid you when asking for feedback from tutors.
Practice creating reports for your mid-semester exam.
Because of this, do not feel restricted to using the dataset I have provided. Feel free to download any dataset that you find interesting and apply what you have learnt in class.
Report guidelines
There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:
It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”
Marking criteria
Lab instructions
There are at least 200 forms of cancer, and many more subtypes. Each of these is caused by errors in DNA that cause cells to grow uncontrolled. Identifying the changes in each cancer’s complete set of DNA - its genome - and understanding how such changes interact to drive the disease will lay the foundation for improving cancer prevention, early detection and treatment.
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, 2.5 petabytes of data describing tumor tissue and matched normal tissues from more than 11,000 patients, is publically available and has been used widely by the research community. The data have contributed to more than a thousand studies of cancer by independent researchers and to the TCGA research network publications.
This data can be downloaded from the data portal in TCGA (https://portal.gdc.cancer.gov/) but is in an aweful format. Thankfully, the team at Recount2 https://jhubiostatistics.shinyapps.io/recount/ have reprocessed all of the TCGA data. In the following, I have code to download the TCGA melanoma data. Many of these code lines only need to be run once.
Read in the data and experiment with normalization.
Can you see any structure in a PCA?
Is this structure explained by any variables in the clinical data?
Let’s start by reading in the data.
### Install recount from Bioconductor, do this once only!!!!
### install.packages('BiocManager') BiocManager::install('recount')
library(recount)
library(dplyr)
### Browse the vignetets for a quick description of how to use the package
### browseVignettes('recount')
### Download the RangedSummarizedExperiment object at the gene level for TCGA
### melanoma data. url <-
### 'http://duffel.rail.bio/recount/v2/TCGA/rse_gene_skin.Rdata' View the url
### for the file by printing the object url url Download the file
### download.file(url = url, destfile = 'rse_gene_skin.Rdata', method = 'auto',
### mode = 'wb')
### Load the data
load("rse_gene_skin.Rdata")
### rse_gene is a RangesSummarizedExperiment
rse_gene
### Gene counts
counts <- assay(rse_gene)
### Gene identifiers
genes <- rowData(rse_gene)
### Phenotype data for the samples
samples <- as.data.frame(colData(rse_gene))
### There are 864 variables in samples, 593 are completely missing
samples %>%
is.na() %>%
colSums() %>%
table()