Aims:
Perform a differential expression analysis.
Correct for multiple hypothesis testing.
Visualise results.
Please make an effort to complete your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations. Try to use your expanding knowledge of R to modify this example code in places where you think it can be altered/improved. Ultimately, this workflow will go from obtaining data from GEO through to pathway analysis, however we will complete the workflow in 3 stages. In week 7 we will focus on obtaining the data and getting through to PCA analysis. In week 8 we will perfom differential gene expression analysis and some visualisations, and in week 9 we will get to the pathway analysis. The workflow uses popular bioconductor packages and is adapted from the published RNAseq 1-2-3 workflow (https://f1000research.com/articles/5-1408).
The following data comes from a manuscript “Evidence from genome wide association studies implicates reduced control of Epstein-Barr virus infection in multiple sclerosis susceptibility” (https://www.ncbi.nlm.nih.gov/pubmed/31039804). There were multiple goals for the RNAseq component of project:
Identify the gene expression changes that occur when B cells are infected with EBV and transformed into LCLs.
Determine which pathways are over-represented in the differentially expressed genes.
Assess if there are more MS risk genes then expected by chance in the differentially expressed genes.
Use this dataset as an opportunity to explore the behaviour of a small RNA-Seq experiment. In this lab we will start by just processing and exploring the data.
IMPORTANT: The code from week 8 continues on from the code generated in week 7. The new steps introduced this week start from step 5.
1A. Install / load required R packages
library(tidyverse)
library(ggfortify)
library(RColorBrewer)
# If any packages of these required are not installed, they can be installed by
# first installing bioconductor: if (!requireNamespace('BiocManager', quietly =
# TRUE)) install.packages('BiocManager') then using
# BiocManager::install('PACKAGENAME')
# load packages to be used in the analysis
library(GEOquery)
library(org.Hs.eg.db)
library(limma)
library(edgeR)
library(Glimma)
library(Homo.sapiens)
library(gplots)
library(dplyr)
1B. Read in the data.
# The count data for this GEO submission is available as a supplementary file on
# GEO. You only want to run this once...
sfiles <- getGEOSuppFiles("GSE126379")
fnames <- rownames(sfiles)
# There is only one supplemental file
Data = read.delim(fnames[1], header = TRUE)
head(Data)
## Chromosome Gene.Name Gene.ID Sum.of.Exon.Length.of.Gene
## 1 chr10 A1CF A1CF 9529
## 2 chr10 ABCC2 ABCC2 5051
## 3 chr10 ABI1 ABI1 3776
## 4 chr10 ABLIM1 ABLIM1 8303
## 5 chr10 ACADSB ACADSB 5941
## 6 chr10 ACBD5 ACBD5 4066
## Reads.Counts..BLCL_A1. RPKM..BLCL_A1. Reads.Counts..BLCL_A2. RPKM..BLCL_A2.
## 1 2 0.0113048 1 0.009967274
## 2 21 0.2239352 42 0.789760929
## 3 1975 28.1718559 1436 36.119869850
## 4 786 5.0987985 2619 29.958784090
## 5 756 6.8539761 543 8.680884936
## 6 535 7.0870705 421 9.834186588
## Reads.Counts..BLCL_B1. RPKM..BLCL_B1. Reads.Counts..BLCL_B2. RPKM..BLCL_B2.
## 1 1 0.01277064 4 0.02185651
## 2 10 0.24092549 89 0.91744656
## 3 563 18.14414625 2307 31.81147083
## 4 220 3.22439153 4464 27.99350563
## 5 210 4.30149944 877 7.68613807
## 6 200 5.98580756 674 8.63099098
## Reads.Counts..BLCL_C1. RPKM..BLCL_C1. Reads.Counts..BLCL_C2. RPKM..BLCL_C2.
## 1 0 0.0000000 11 0.122567
## 2 13 0.2041644 63 1.324315
## 3 695 14.6004734 1235 34.726663
## 4 400 3.8215458 2600 33.248066
## 5 264 3.5249949 568 10.151188
## 6 304 5.9308963 492 12.847712
## Reads.Counts..BLCL_D1. RPKM..BLCL_D1. Reads.Counts..BLCL_D2. RPKM..BLCL_D2.
## 1 2 0.02147553 10 0.04834247
## 2 25 0.50643522 163 1.48657342
## 3 847 22.95158503 2806 34.23196323
## 4 626 7.71437158 5804 32.20093570
## 5 272 4.68457946 1065 8.25783462
## 6 250 6.29120581 1139 12.90424131
## Reads.Counts..BLCL_E1. RPKM..BLCL_E1. Reads.Counts..BLCL_E2. RPKM..BLCL_E2.
## 1 1 0.01387007 7 0.04338079
## 2 54 1.41300065 130 1.51989185
## 3 986 34.51206142 2471 38.64447949
## 4 177 2.81750179 4484 31.89167049
## 5 302 6.71851610 1039 10.32767820
## 6 313 10.17426241 932 13.53615256
# Next time you can just run... fnames <-
# 'GSE126379/GSE126379_BLCL_processed.txt.gz' Data =
# read.delim(fnames[1],header=TRUE) head(Data)
# MANIPULATE GENE ANNOTATIONS
# Genes have multiple 'identifiers' in different databases. Later on we will need
# identifiers from the ENTREZ database. To save us some pain later, lets extract
# these now using the select() function from the AnnotationDbi package. The
# ENTREZ identifiers can be found in the Homo.sapiens package.
ENTREZID <- mapIds(Homo.sapiens, keys = as.character(Data$Gene.Name), column = c("ENTREZID"),
keytype = "SYMBOL", multiVals = "first")
## 'select()' returned 1:many mapping between keys and columns
genes <- data.frame(ENTREZID, SYMBOL = Data$Gene.Name)
# In this file of counts, there are multiple instances where the same gene symbol
# or entrez id appears in multiple rows. This may be due to how it was initially
# generated but as we are not able to distinguish between them (and in most cases
# the count information is very similar for the duplicate rows, we will just ask
# for just one row to be kept for any given gene name)
chooseGenes <- which(!duplicated(genes$SYMBOL) & !duplicated(genes$ENTREZID))
genes <- genes[chooseGenes, ]
Data <- Data[chooseGenes, ]
# MANIPULATE GENE COUNTS
# We do not need the RPKMs just the read counts, so lets just pull out these
# columns
colnames(Data)
## [1] "Chromosome" "Gene.Name"
## [3] "Gene.ID" "Sum.of.Exon.Length.of.Gene"
## [5] "Reads.Counts..BLCL_A1." "RPKM..BLCL_A1."
## [7] "Reads.Counts..BLCL_A2." "RPKM..BLCL_A2."
## [9] "Reads.Counts..BLCL_B1." "RPKM..BLCL_B1."
## [11] "Reads.Counts..BLCL_B2." "RPKM..BLCL_B2."
## [13] "Reads.Counts..BLCL_C1." "RPKM..BLCL_C1."
## [15] "Reads.Counts..BLCL_C2." "RPKM..BLCL_C2."
## [17] "Reads.Counts..BLCL_D1." "RPKM..BLCL_D1."
## [19] "Reads.Counts..BLCL_D2." "RPKM..BLCL_D2."
## [21] "Reads.Counts..BLCL_E1." "RPKM..BLCL_E1."
## [23] "Reads.Counts..BLCL_E2." "RPKM..BLCL_E2."
counts <- Data[, grep("Counts", colnames(Data))]
# LCL are labelled 1, B cell labelled 2. Lets rename the columns to reflect this
colnames(counts) <- sub("Reads.Counts..BLCL_", "", colnames(counts))
colnames(counts) <- sub("1", "LCL", colnames(counts))
colnames(counts) <- sub("2", "CD19B", colnames(counts))
colnames(counts) <- sub("CD19B.", "_CD19B", colnames(counts))
colnames(counts) <- sub("LCL.", "_LCL", colnames(counts))
colnames(counts)
## [1] "A_LCL" "A_CD19B" "B_LCL" "B_CD19B" "C_LCL" "C_CD19B" "D_LCL"
## [8] "D_CD19B" "E_LCL" "E_CD19B"
# make the Gene IDs the name of the rows
rownames(counts) <- genes$SYMBOL
# MANIPULATE SAMPLE ANNOTATIONS
# This is a function that will be in the next release of tidyverse. We can use
# it to split a string on a certain pattern and then select things that are
# before or after than split
str_split_n <- function(string, pattern, n) {
out <- str_split(string, pattern, simplify = TRUE)
apply(out, 1, `[`, i = n)
}
# use sample names to make factors for group and individual
group <- str_split_n(colnames(counts), "_", 2) %>% factor()
individual <- str_split_n(colnames(counts), "_", 1) %>% factor()
# put this factor information into a data frame
samples <- data.frame(group, individual)
samples
## group individual
## 1 LCL A
## 2 CD19B A
## 3 LCL B
## 4 CD19B B
## 5 LCL C
## 6 CD19B C
## 7 LCL D
## 8 CD19B D
## 9 LCL E
## 10 CD19B E
# MAKE DGEList
# make DGEList object to connect gene counts with sample information
BLCL <- DGEList(counts = counts, samples = samples, genes = genes)
BLCL
## An object of class "DGEList"
## $counts
## A_LCL A_CD19B B_LCL B_CD19B C_LCL C_CD19B D_LCL D_CD19B E_LCL E_CD19B
## A1CF 2 1 1 4 0 11 2 10 1 7
## ABCC2 21 42 10 89 13 63 25 163 54 130
## ABI1 1975 1436 563 2307 695 1235 847 2806 986 2471
## ABLIM1 786 2619 220 4464 400 2600 626 5804 177 4484
## ACADSB 756 543 210 877 264 568 272 1065 302 1039
## 19920 more rows ...
##
## $samples
## group lib.size norm.factors individual
## A_LCL LCL 17028144 1 A
## A_CD19B CD19B 9577596 1 A
## B_LCL LCL 7531842 1 B
## B_CD19B CD19B 17510653 1 B
## C_LCL LCL 11548555 1 C
## C_CD19B CD19B 8503027 1 C
## D_LCL LCL 8927495 1 D
## D_CD19B CD19B 19627791 1 D
## E_LCL LCL 6724158 1 E
## E_CD19B CD19B 15276274 1 E
##
## $genes
## ENTREZID SYMBOL
## A1CF 29974 A1CF
## ABCC2 1244 ABCC2
## ABI1 10006 ABI1
## ABLIM1 3983 ABLIM1
## ACADSB 36 ACADSB
## 19920 more rows ...
# COUNTS NORMALIZATION
# First we calculate normalization factors
BLCL <- calcNormFactors(BLCL, method = "TMM")
# make a copy of BLCLDGEList to use in extension tab.
BLCL2 <- BLCL
# show the nomalisation factors. For this dataset the effect of TMM-normalisation
# is mild, as evident in the magnitude of the scaling factors, which are all
# relatively close to 1.
BLCL$samples$norm.factors
## [1] 1.0140834 1.0134316 0.9123651 0.9568436 0.9213837 1.1034365 0.8930159
## [8] 1.0526869 1.1040319 1.0563174
# Next we will keep genes with large enough counts to be considered useful.
keep <- filterByExpr(BLCL)
BLCL <- BLCL[keep, ]
# Calculate the counts per million reads for each gene for all samples
cpm <- cpm(BLCL)
# now calculate the log of these CPM values
lcpm <- cpm(BLCL, log = TRUE)
# Perform PCA
pca <- prcomp(t(lcpm), scale = TRUE)
# Plot results
autoplot(pca, data = samples, colour = "group", shape = "individual")
# Another method for dimension reduction is called MDS.
plotMDS(lcpm, labels = group)
plotMDS(lcpm, labels = individual)
# I also like this cool interactive version of a MDS plot which can be generated
# using Glimma
glMDSPlot(lcpm, labels = paste(group, individual, sep = "_"), groups = BLCL$samples[,
c("group", "individual")], launch = TRUE)
limma
is a versatile package for DE analysis of microarray data. However, the count based data from RNAseq has certain stastical properties that are different to microarray based data, namely that in RNAseq count data, the variance is not independent of the mean. The authors of limma
delevoped the voom
function which accounts for this and modifies the RNAseq count data for use with the limma
package.
# We first need to set up our design matrix, where we tell limma that we are
# interested in fitting the gene expression data against group, but also to
# include the 'individual' term, as this will account for the samples being
# paired.
design <- model.matrix(~individual + group)
design
## (Intercept) individualB individualC individualD individualE groupLCL
## 1 1 0 0 0 0 1
## 2 1 0 0 0 0 0
## 3 1 1 0 0 0 1
## 4 1 1 0 0 0 0
## 5 1 0 1 0 0 1
## 6 1 0 1 0 0 0
## 7 1 0 0 1 0 1
## 8 1 0 0 1 0 0
## 9 1 0 0 0 1 1
## 10 1 0 0 0 1 0
## attr(,"assign")
## [1] 0 1 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$individual
## [1] "contr.treatment"
##
## attr(,"contrasts")$group
## [1] "contr.treatment"
# Voom starts with the count data, transforms it into logCPM, estimates the
# mean-variance relationship, and uses this to compute appropriate
# observation-level weights. The data can then be used for linear modelling in
# the same way as microarray data. voom generates an EList object, which is
# similar to the DGEList-object
v <- voom(BLCL, design, plot = TRUE)
# The voom plot shows that whilst not strong, there is some relationship between
# variance and expression level in this data.
v
## An object of class "EList"
## $genes
## ENTREZID SYMBOL
## ABCC2 1244 ABCC2
## ABI1 10006 ABI1
## ABLIM1 3983 ABLIM1
## ACADSB 36 ACADSB
## ACBD5 91452 ACBD5
## 12170 more rows ...
##
## $targets
## group lib.size norm.factors individual
## A_LCL LCL 17267958 1.0140834 A
## A_CD19B CD19B 9706239 1.0134316 A
## B_LCL LCL 6871789 0.9123651 B
## B_CD19B CD19B 16754956 0.9568436 B
## C_LCL LCL 10640650 0.9213837 C
## C_CD19B CD19B 9382550 1.1034365 C
## D_LCL LCL 7972395 0.8930159 D
## D_CD19B CD19B 20661918 1.0526869 D
## E_LCL LCL 7423685 1.1040319 E
## E_CD19B CD19B 16136594 1.0563174 E
##
## $E
## A_LCL A_CD19B B_LCL B_CD19B C_LCL C_CD19B D_LCL
## ABCC2 0.3162391 2.130478 0.6116314 2.417300 0.343373 2.758704 1.677412
## ABI1 6.8379765 7.209430 6.3575858 7.105599 6.030392 7.040899 6.732056
## ABLIM1 5.5092773 8.076163 5.0039488 8.057767 5.234144 8.114593 6.296157
## ACADSB 5.4531706 5.807224 4.9369904 5.710739 4.635609 5.921036 5.095099
## ACBD5 4.9547171 5.440476 4.8667724 5.331159 4.838784 5.714000 4.973653
## D_CD19B E_LCL E_CD19B
## ABCC2 2.984244 2.876049 3.015642
## ABI1 7.085654 7.054040 7.258907
## ABLIM1 8.134054 4.579540 8.118467
## ACADSB 5.688413 5.348656 6.009410
## ACBD5 5.785283 5.400186 5.852696
## 12170 more rows ...
##
## $weights
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 2.331025 2.979675 1.652737 4.488719 2.053346 3.313986 2.324869 6.708812
## [2,] 9.358623 9.575208 9.816939 9.169044 10.018719 9.898932 10.011455 8.613684
## [3,] 9.983437 8.720541 7.722519 8.095354 9.196385 8.921615 9.370692 7.169613
## [4,] 9.908408 9.840943 7.327472 10.007666 8.542400 9.442735 7.924945 9.842946
## [5,] 9.590230 9.318834 6.969832 9.963088 8.654301 9.360818 8.021923 9.885127
## [,9] [,10]
## [1,] 2.802607 7.258436
## [2,] 10.004491 8.747525
## [3,] 7.572132 8.368332
## [4,] 8.307608 9.884701
## [5,] 8.318522 9.944736
## 12170 more rows ...
##
## $design
## (Intercept) individualB individualC individualD individualE groupLCL
## 1 1 0 0 0 0 1
## 2 1 0 0 0 0 0
## 3 1 1 0 0 0 1
## 4 1 1 0 0 0 0
## 5 1 0 1 0 0 1
## 6 1 0 1 0 0 0
## 7 1 0 0 1 0 1
## 8 1 0 0 1 0 0
## 9 1 0 0 0 1 1
## 10 1 0 0 0 1 0
## attr(,"assign")
## [1] 0 1 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$individual
## [1] "contr.treatment"
##
## attr(,"contrasts")$group
## [1] "contr.treatment"
# Linear modelling in limma is carried out using the 'lmFit' and 'contrasts.fit'
# functions. These functions fit a separate model to the expression values for
# each gene. We now tell limma to fit the model specified in our design matrix
# for the voom transformed data
fit <- lmFit(v, design)
# Next, empirical Bayes moderation is carried out by borrowing information across
# all genes to obtain more precise estimates of gene-wise variability.
efit <- eBayes(fit)
# Quick look at how many genes are differentially expressed
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)
## (Intercept) individualB individualC individualD individualE groupLCL
## Down 354 1 0 1 10 4220
## NotSig 947 12173 12169 12172 12156 3613
## Up 10874 1 6 2 9 4342
# The coefficient we are interested in is 'groupLCL'. Does the number of DE
# genes change if we instead use the more stringent bonferroni correction?
dt_bonf <- decideTests(efit, adjust.method = "bonferroni", p.value = 0.05)
summary(dt_bonf)
## (Intercept) individualB individualC individualD individualE groupLCL
## Down 34 1 0 1 5 799
## NotSig 3114 12174 12173 12173 12168 10140
## Up 9027 0 2 1 2 1236
# histogram of p value distribution for groupLCL colnames(efit)
hist(efit$p.value[, "groupLCL"])
# volcano plot of results for groupLCL
volcanoplot(efit, coef = "groupLCL", highlight = 10, names = efit$genes$SYMBOL)
# Generate a table of the genes significantly differentially expressed in LCLs v
# B cells (groupLCL)
DELCLvB <- topTable(efit, coef = "groupLCL", adjust.method = "fdr", p.value = 0.05,
number = Inf)
# we can write this table to a file
write.csv(DELCLvB, file = "DELCLvB.csv")
# Generate a mean-difference plot (MD-plot) for the group comparison (groupLCL,
# which is column 6 of the efit object)
useCoef <- which(colnames(efit) == "groupLCL")
plotMD(efit, column = useCoef, status = dt_fdr[, useCoef], main = colnames(efit)[useCoef],
)
# we can also generate an interactive MDplot using Glimma
glMDPlot(efit, coef = useCoef, status = dt_fdr[, useCoef], main = colnames(efit)[useCoef],
side.main = "ENTREZID", counts = lcpm, groups = group, launch = TRUE)
# Get a list of the top 40 most DE genes. We will use this to draw a heatmap
LCLvB.topgenes <- DELCLvB$SYMBOL[1:40]
# pull out the voom expression data for these 40 genes
i <- which(v$genes$SYMBOL %in% LCLvB.topgenes)
coolmap(lcpm[i, ], margin = c(8, 6), cexRow = 0.7, cexCol = 0.7)
# You could save this with the pdf() function.
pdf("heatmap.pdf", width = 12, height = 16)
coolmap(lcpm[i, ], margin = c(8, 6))
dev.off()
## png
## 2
# You might like to look at the library pheatmap
CD40 <- lcpm["CD40", ]
df <- data.frame(CD40, group, individual)
ggplot(df, aes(x = group, y = CD40, colour = individual)) + geom_point() + geom_line(aes(group = individual)) +
theme_classic()
The following data comes from a manuscript “A distinct influenza infection signature in the blood transcriptome of patients with severe community-acquired pneumonia”.https://www.ncbi.nlm.nih.gov/pubmed/22898401 .
Diagnosis of severe influenza pneumonia remains challenging because of a lack of correlation between the presence of influenza virus and clinical status. Gene-expression profiling was conducted on the whole blood of critically ill patients to identify a gene signature that would allow clinicians to distinguish influenza infection from other causes of severe respiratory failure, such as bacterial pneumonia, and noninfective systemic inflammatory response syndrome. Samples were assayed on Illumina HT-12 gene-expression beadarrays.
# Retrieve dataset from GEO. getGEO gives you a list of expressionsets, as we
# only have one, we might as well just pull out that one (if (length(gset) > 1)
# idx <- grep('GPL6947', attr(gset, 'names')) else idx <- 1 | gset <-
# gset[[idx]])
dir.create("GSE40012")
## Warning in dir.create("GSE40012"): 'GSE40012' already exists
gset <- getGEO("GSE40012", GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = "GSE40012")[[1]]
# Have a look at the summary info for the ExpressionSet
gset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 48754 features, 190 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM983403 GSM983404 ... GSM983592 (190 total)
## varLabels: title geo_accession ... tissue:ch1 (41 total)
## varMetadata: labelDescription
## featureData
## featureNames: ILMN_1343291 ILMN_1343295 ... ILMN_2416019 (48754
## total)
## fvarLabels: ID Gene title ... Platform_SEQUENCE (22 total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 22898401
## Annotation: GPL6947
# This dataset has many variable labels (including phenotype information). The
# phenotype data in an expressionset is stored in pData. Let's use the phenotype
# data to pull out just the day 1 samples
pheno <- pData(gset)
colnames(pData(gset))
## [1] "title" "geo_accession"
## [3] "status" "submission_date"
## [5] "last_update_date" "type"
## [7] "channel_count" "source_name_ch1"
## [9] "organism_ch1" "characteristics_ch1"
## [11] "characteristics_ch1.1" "characteristics_ch1.2"
## [13] "characteristics_ch1.3" "characteristics_ch1.4"
## [15] "molecule_ch1" "extract_protocol_ch1"
## [17] "label_ch1" "label_protocol_ch1"
## [19] "taxid_ch1" "hyb_protocol"
## [21] "scan_protocol" "description"
## [23] "data_processing" "platform_id"
## [25] "contact_name" "contact_email"
## [27] "contact_laboratory" "contact_department"
## [29] "contact_institute" "contact_address"
## [31] "contact_city" "contact_state"
## [33] "contact_zip/postal_code" "contact_country"
## [35] "supplementary_file" "data_row_count"
## [37] "day:ch1" "gender:ch1"
## [39] "ID:ch1" "sample type:ch1"
## [41] "tissue:ch1"
D1 <- gset[, pheno$`day:ch1` == "1"]
# Phenotype data
pheno <- pData(D1)
# The expression values are stored within 'exprs'. For this dataset the
# expression values have already been normalised using Quantile normalisation,
# but they have not been log transformed. So we will log transform the expression
# values in D1.
exprs(D1) <- log2(exprs(D1))
expVal <- exprs(D1)
# Each probe of the microarray has detailed annotation. To see the specific
# annotation levels we can type fData. Currently the rownames of our expression
# data is the Illumina probe id. This is not convenient for pulling down a gene
# symbol of interest. More gene ID types can be accessed with fData
geneAnno <- fData(D1)
# I wish to shorten the sample types to say bact, flu, mixed, SIRS, HC, As they
# are they are too long to easily display on heatmaps etc.
sampleType <- pheno$`sample type:ch1`
shortNames <- sub("healthy control", "HC", sampleType)
shortNames <- sub("systemic inflammatory response", "SIRS", shortNames)
shortNames <- sub("mixed bacterial and influenza A pneumonia", "mixed", shortNames)
shortNames <- sub("influenza A pneumonia", "flu", shortNames)
shortNames <- sub("bacterial pneumonia", "bact", shortNames)
shortNames
## [1] "SIRS" "bact" "bact" "SIRS" "bact" "bact" "flu" "flu" "bact"
## [10] "flu" "flu" "SIRS" "SIRS" "bact" "HC" "HC" "HC" "HC"
## [19] "HC" "HC" "HC" "HC" "HC" "HC" "HC" "HC" "HC"
## [28] "HC" "HC" "HC" "HC" "HC" "mixed" "SIRS" "SIRS" "SIRS"
## [37] "bact" "bact" "bact" "SIRS" "flu" "flu" "SIRS" "bact" "flu"
## [46] "mixed" "bact" "bact" "bact" "SIRS" "SIRS" "flu" "bact" "bact"
## [55] "SIRS" "mixed" "bact"
# store this phenotype information as a data frame
pheno$group <- shortNames
# Lets take a first look at the dataset. We can perform a PCA analysis.
useGenes <- rowMeans(expVal) > 7
pca <- prcomp(t(expVal[useGenes, ]), scale = TRUE)
autoplot(pca, pheno, colour = "group", size = 2)
autoplot(pca, x = 3, y = 4, pheno, colour = "group", size = 2)
# PCA extension!!! Start again and perform a PCA on the dataset that has up to 5
# days for each patient
# As we have simplified our dataset our analysis model is fairly simple, just
# comparing between the 2 groups
design <- model.matrix(~0 + group, pheno)
design
## groupbact groupflu groupHC groupmixed groupSIRS
## GSM983405 0 0 0 0 1
## GSM983407 1 0 0 0 0
## GSM983416 1 0 0 0 0
## GSM983418 0 0 0 0 1
## GSM983423 1 0 0 0 0
## GSM983428 1 0 0 0 0
## GSM983429 0 1 0 0 0
## GSM983434 0 1 0 0 0
## GSM983439 1 0 0 0 0
## GSM983441 0 1 0 0 0
## GSM983445 0 1 0 0 0
## GSM983452 0 0 0 0 1
## GSM983455 0 0 0 0 1
## GSM983456 1 0 0 0 0
## GSM983459 0 0 1 0 0
## GSM983461 0 0 1 0 0
## GSM983463 0 0 1 0 0
## GSM983465 0 0 1 0 0
## GSM983467 0 0 1 0 0
## GSM983469 0 0 1 0 0
## GSM983471 0 0 1 0 0
## GSM983473 0 0 1 0 0
## GSM983475 0 0 1 0 0
## GSM983477 0 0 1 0 0
## GSM983479 0 0 1 0 0
## GSM983481 0 0 1 0 0
## GSM983483 0 0 1 0 0
## GSM983485 0 0 1 0 0
## GSM983487 0 0 1 0 0
## GSM983489 0 0 1 0 0
## GSM983491 0 0 1 0 0
## GSM983493 0 0 1 0 0
## GSM983497 0 0 0 1 0
## GSM983501 0 0 0 0 1
## GSM983505 0 0 0 0 1
## GSM983508 0 0 0 0 1
## GSM983510 1 0 0 0 0
## GSM983514 1 0 0 0 0
## GSM983515 1 0 0 0 0
## GSM983518 0 0 0 0 1
## GSM983524 0 1 0 0 0
## GSM983529 0 1 0 0 0
## GSM983534 0 0 0 0 1
## GSM983539 1 0 0 0 0
## GSM983544 0 1 0 0 0
## GSM983547 0 0 0 1 0
## GSM983551 1 0 0 0 0
## GSM983555 1 0 0 0 0
## GSM983557 1 0 0 0 0
## GSM983562 0 0 0 0 1
## GSM983570 0 0 0 0 1
## GSM983574 0 1 0 0 0
## GSM983577 1 0 0 0 0
## GSM983578 1 0 0 0 0
## GSM983583 0 0 0 0 1
## GSM983585 0 0 0 1 0
## GSM983590 1 0 0 0 0
## attr(,"assign")
## [1] 1 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
colnames(design) <- sub("group", "", colnames(design))
cont.matrix <- makeContrasts(fluvsbact = flu - bact, levels = design)
cont.matrix
## Contrasts
## Levels fluvsbact
## bact -1
## flu 1
## HC 0
## mixed 0
## SIRS 0
# fit the model
fit <- lmFit(D1, design)
# fit the model for the specified contrasts. For us this is only an option of 1,
# but if we had more than 2 groups here we could define what groups to compare
fit2 <- contrasts.fit(fit, cont.matrix)
# Create moderated test statistics using eBayes
efit <- eBayes(fit2)
# Quick look at how many genes are differentially expressed
dt_fdr <- decideTests(efit, adjust.method = "fdr", p.value = 0.05)
summary(dt_fdr)
## fluvsbact
## Down 612
## NotSig 46881
## Up 1261
# How does the number of DE change if we instead use the more stringent
# bonferroni correction?
dt_bonf <- decideTests(efit, adjust.method = "bonferroni", p.value = 0.05)
summary(dt_bonf)
## fluvsbact
## Down 18
## NotSig 48599
## Up 137
# histogram of p value distribution for groupLCL
hist(efit$p.value[, "fluvsbact"])
# volcano plot
volcanoplot(efit, coef = "fluvsbact", highlight = 10, names = geneAnno$`Gene symbol`)
# Generate a table of the genes significantly differentially expressed in
# fluvsbact
DE_fluvsbact <- topTable(efit, coef = "fluvsbact", adjust.method = "fdr", p.value = 0.05,
number = Inf)
# we can write this table to a file
write.csv(DE_fluvsbact, file = "fluvsbact.csv")
# Generate a mean-difference plot (MD-plot)
colnames(efit)
## [1] "fluvsbact"
glMDPlot(efit, coef = 1, status = dt_bonf[, 1], main = colnames(efit)[1], side.main = "UNIQUEID",
counts = expVal, groups = pheno$group, launch = TRUE)
# Get a list of the top 40 most DE genes. We will use this to draw a heatmap
topGenes <- rownames(DE_fluvsbact)[1:40]
# pull out the expression data for these 40 genes
i <- which(rownames(D1) %in% topGenes)
coolmap(expVal[i, ], labCol = pheno$group, labRow = DE_fluvsbact$Gene.symbol, margin = c(8,
6), cexRow = 0.7, cexCol = 0.7)
# Now we can pull down our gene of interest by its gene symbol.
useGene <- rownames(expVal)[which(geneAnno$`Gene symbol` == "IFI27")]
df <- data.frame(pheno, IFI27 = expVal[useGene, ])
ggplot(df, aes(x = group, y = IFI27, colour = group)) + geom_boxplot() + geom_point() +
theme_classic()
You may wish to browse the RNA-seq 1-2-3 protocol published in F1000Research which performs an RNAseq analysis on different cell populations harvested from mice mammary gland tissue. You could work through the code in RStudio step by step for extra practice. Note that their workflow will proceed all the way to differential gene expression testing and pathway analysis which we will not visit until weeks 8 and 9. See https://f1000research.com/articles/5-1408 .
Only this section needs to be included in your report. This report will be due at the end of Week 9, Friday 11:59pm 7th May.
While I expect that you should explore the TCGA melanoma data, I am not opposed to you submiting a report on a dataset that you are incredibly engaged with.
Report purpose
The purpose of this report is to:
Provide you with an opportunity to apply the methods and skills you have been exposed to in class.
Create a document that can aid you when asking for feedback from tutors.
Practice creating reports for your mid-semester exam.
Because of this, do not feel restricted to using the dataset I have provided. Feel free to download any dataset that you find interesting and apply what you have learnt in class.
Report guidelines
There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:
It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”
Marking criteria
Lab instructions
There are at least 200 forms of cancer, and many more subtypes. Each of these is caused by errors in DNA that cause cells to grow uncontrolled. Identifying the changes in each cancer’s complete set of DNA - its genome - and understanding how such changes interact to drive the disease will lay the foundation for improving cancer prevention, early detection and treatment.
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, 2.5 petabytes of data describing tumor tissue and matched normal tissues from more than 11,000 patients, is publically available and has been used widely by the research community. The data have contributed to more than a thousand studies of cancer by independent researchers and to the TCGA research network publications.
This data can be downloaded from the data portal in TCGA (https://portal.gdc.cancer.gov/) but is in an aweful format. Thankfully, the team at Recount2 https://jhubiostatistics.shinyapps.io/recount/ have reprocessed all of the TCGA data. In the following, I have code to download the TCGA melanoma data. Many of these code lines only need to be run once.
Read in the data and experiment with normalization.
Can you see any structure in a PCA?
Is this structure explained by any variables in the clinical data?
Let’s start by reading in the data.
### Install recount from Bioconductor, do this once only!!!!
### install.packages('BiocManager') BiocManager::install('recount')
library(recount)
library(dplyr)
### Browse the vignetets for a quick description of how to use the package
### browseVignettes('recount')
### Download the RangedSummarizedExperiment object at the gene level for TCGA
### melanoma data. url <-
### 'http://duffel.rail.bio/recount/v2/TCGA/rse_gene_skin.Rdata' View the url for
### the file by printing the object url url Download the file download.file(url =
### url, destfile = 'rse_gene_skin.Rdata', method = 'auto', mode = 'wb')
### Load the data
load("rse_gene_skin.Rdata")
### rse_gene is a RangesSummarizedExperiment
rse_gene
### Gene counts
counts <- assay(rse_gene)
### Gene identifiers
genes <- rowData(rse_gene)
### Phenotype data for the samples
samples <- as.data.frame(colData(rse_gene))
### There are 864 variables in samples, 593 are completely missing
samples %>% is.na() %>% colSums() %>% table()