Please make an effort to complete your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations. Try to use your expanding knowledge of R to modify this example code in places where you think it can be altered/improved. Ultimately, this workflow will go from obtaining data from GEO through to pathway analysis, however we will complete the workflow in 3 stages. In week 7 we will focus on obtaining the data and getting through to PCA analysis. In week 8 we will perfom differential gene expression analysis and some visualisations, and in week 9 we will get to the pathway analysis. The workflow uses popular bioconductor packages and is adapted from the published RNAseq 1-2-3 workflow (https://f1000research.com/articles/5-1408).
The following data comes from a manuscript “Evidence from genome wide association studies implicates reduced control of Epstein-Barr virus infection in multiple sclerosis susceptibility” (https://www.ncbi.nlm.nih.gov/pubmed/31039804). There were multiple goals for the RNAseq component of project:
Identify the gene expression changes that occur when B cells are infected with EBV and transformed into LCLs.
Determine which pathways are over-represented in the differentially expressed genes.
Assess if there are more MS risk genes then expected by chance in the differentially expressed genes.
Use this dataset as an opportunity to explore the behaviour of a small RNA-Seq experiment. In this lab we will start by just processing and exploring the data.
library(tidyverse)
library(ggfortify)
library(RColorBrewer)
# If any packages of these required are not installed, they can be installed by
# first installing bioconductor: if (!requireNamespace('BiocManager', quietly =
# TRUE)) install.packages('BiocManager') then using
# BiocManager::install('PACKAGENAME')
# load packages to be used in the analysis
library(GEOquery)
library(org.Hs.eg.db)
library(limma)
library(edgeR)
library(Glimma)
library(Homo.sapiens)
# The count data for this GEO submission is available as a supplementary file
# on GEO. You only want to run this once...
sfiles <- getGEOSuppFiles("GSE126379")
fnames <- rownames(sfiles)
# There is only one supplemental file
Data = read.delim(fnames[1], header = TRUE)
head(Data)
# Next time you can just run...
fnames <- "GSE126379/GSE126379_BLCL_processed.txt.gz"
Data = read.delim(fnames[1], header = TRUE)
head(Data)
# MANIPULATE GENE ANNOTATIONS
# Genes have multiple 'identifiers' in different databases. Later on we will
# need identifiers from the ENTREZ database. To save us some pain later, lets
# extract these now using the select() function from the AnnotationDbi package.
# The ENTREZ identifiers can be found in the Homo.sapiens package.
ENTREZID <- mapIds(Homo.sapiens, keys = as.character(Data$Gene.Name), column = c("ENTREZID"),
keytype = "SYMBOL", multiVals = "first")
genes <- data.frame(ENTREZID, SYMBOL = Data$Gene.Name)
# In this file of counts, there are multiple instances where the same gene
# symbol or entrez id appears in multiple rows. This may be due to how it was
# initially generated but as we are not able to distinguish between them (and
# in most cases the count information is very similar for the duplicate rows,
# we will just ask for just one row to be kept for any given gene name)
chooseGenes <- which(!duplicated(genes$SYMBOL) & !duplicated(genes$ENTREZID))
genes <- genes[chooseGenes, ]
Data <- Data[chooseGenes, ]
# MANIPULATE GENE COUNTS
# We do not need the RPKMs just the read counts, so lets just pull out these
# columns
colnames(Data)
counts <- Data[, grep("Counts", colnames(Data))]
# LCL are labelled 1, B cell labelled 2. Lets rename the columns to reflect
# this
colnames(counts) <- sub("Reads.Counts..BLCL_", "", colnames(counts))
colnames(counts) <- sub("1", "LCL", colnames(counts))
colnames(counts) <- sub("2", "CD19B", colnames(counts))
colnames(counts) <- sub("CD19B.", "_CD19B", colnames(counts))
colnames(counts) <- sub("LCL.", "_LCL", colnames(counts))
colnames(counts)
# make the Gene IDs the name of the rows
rownames(counts) <- genes$SYMBOL
# MANIPULATE SAMPLE ANNOTATIONS
# This is a function that will be in the next release of tidyverse. We can use
# it to split a string on a certain pattern and then select things that are
# before or after than split
str_split_n <- function(string, pattern, n) {
out <- str_split(string, pattern, simplify = TRUE)
apply(out, 1, `[`, i = n)
}
# use sample names to make factors for group and individual
group <- str_split_n(colnames(counts), "_", 2) %>%
factor()
individual <- str_split_n(colnames(counts), "_", 1) %>%
factor()
# put this factor information into a data frame
samples <- data.frame(group, individual)
samples
# MAKE DGEList
# make DGEList object to connect gene counts with sample information
BLCL <- DGEList(counts = counts, samples = samples, genes = genes)
BLCL
# COUNTS NORMALIZATION
# First we calculate normalization factors
BLCL <- calcNormFactors(BLCL, method = "TMM")
# make a copy of BLCLDGEList to use in extension tab.
BLCL2 <- BLCL
# show the nomalisation factors. For this dataset the effect of
# TMM-normalisation is mild, as evident in the magnitude of the scaling
# factors, which are all relatively close to 1.
BLCL$samples$norm.factors
# Next we will keep genes with large enough counts to be considered useful.
keep <- filterByExpr(BLCL)
BLCL <- BLCL[keep, ]
# Calculate the counts per million reads for each gene for all samples
cpm <- cpm(BLCL)
# now calculate the log of these CPM values
lcpm <- cpm(BLCL, log = TRUE)
# Perform PCA
pca <- prcomp(t(lcpm), scale = TRUE)
# Plot results
autoplot(pca, data = samples, colour = "group", shape = "individual")
# Another method for dimension reduction is called MDS.
plotMDS(lcpm, labels = group)
plotMDS(lcpm, labels = individual)
# I also like this cool interactive version of a MDS plot which can be
# generated using Glimma
glMDSPlot(lcpm, labels = paste(group, individual, sep = "_"), groups = BLCL$samples[,
c("group", "individual")], launch = TRUE)
# Lets look visually at what we have done by removing the lowly expressed
# genes: this is using RcolourBrewer
lcpm2 <- cpm(BLCL2, log = TRUE)
samplenames <- colnames(BLCL)
nsamples <- ncol(BLCL)
col <- brewer.pal(nsamples, "Paired")
par(mfrow = c(1, 2))
plot(density(lcpm2[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "",
xlab = "")
title(main = "A. Raw data", xlab = "Log-cpm")
for (i in 2:nsamples) {
den <- density(lcpm2[, i])
lines(den$x, den$y, col = col[i], lwd = 2)
}
legend("topright", samplenames, text.col = col, bty = "n")
lcpm <- cpm(BLCL, log = TRUE)
plot(density(lcpm[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "",
xlab = "")
title(main = "B. Filtered data", xlab = "Log-cpm")
for (i in 2:nsamples) {
den <- density(lcpm[, i])
lines(den$x, den$y, col = col[i], lwd = 2)
}
legend("topright", samplenames, text.col = col, bty = "n")
# These figures also show that there is no major difference in the distribution
# of the expression values across the samples.
# demonstrating the effect of normalisation. To give a better visual
# representation of the effects of normalisation, the data is duplicated then
# adjusted so that the counts of the first sample are reduced to 5% of their
# original values, and in the second sample they are inflated to be 5-times
# larger.
BLCL3 <- BLCL2
BLCL3$samples$norm.factors <- 1
BLCL3$counts[, 1] <- ceiling(BLCL3$counts[, 1] * 0.05)
BLCL3$counts[, 2] <- BLCL3$counts[, 2] * 5
par(mfrow = c(1, 2))
lcpm3 <- cpm(BLCL3, log = TRUE)
boxplot(lcpm3, las = 2, col = col, main = "")
title(main = "A. Example: Unnormalised data", ylab = "Log-cpm")
BLCL3 <- calcNormFactors(BLCL3, method = "TMM")
BLCL3$samples$norm.factors
lcpm3 <- cpm(BLCL3, log = TRUE)
boxplot(lcpm3, las = 2, col = col, main = "")
title(main = "B. Example: Normalised data", ylab = "Log-cpm")
# MDS plot of 3rd and 4th dimensions
plotMDS(lcpm, labels = group, dim = c(3, 4))
# PCA plot for 3rd and 4th dimensions. Can also be done for 2nd vs 3rd
# dimension etc.
autoplot(pca, x = 3, y = 4, data = samples, colour = "group", shape = "individual")
You may wish to browse the RNA-seq 1-2-3 protocol published in F1000Research which performs an RNAseq analysis on different cell populations harvested from mice mammary gland tissue. You could work through the code in RStudio step by step for extra practice. Note that their workflow will proceed all the way to differential gene expression testing and pathway analysis which we will not visit until weeks 8 and 9. See https://f1000research.com/articles/5-1408 .
Only this section needs to be included in your report. This report will be due at the end of Week 9, Friday 11:59pm 29th April.
While I expect that you should explore the TCGA melanoma data, I am not opposed to you submiting a report on a dataset that you are incredibly engaged with.
Report purpose
The purpose of this report is to:
Provide you with an opportunity to apply the methods and skills you have been exposed to in class.
Create a document that can aid you when asking for feedback from tutors.
Practice creating reports for your mid-semester exam.
Because of this, do not feel restricted to using the dataset I have provided. Feel free to download any dataset that you find interesting and apply what you have learnt in class.
Report guidelines
There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:
It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”
Marking criteria
Lab instructions
There are at least 200 forms of cancer, and many more subtypes. Each of these is caused by errors in DNA that cause cells to grow uncontrolled. Identifying the changes in each cancer’s complete set of DNA - its genome - and understanding how such changes interact to drive the disease will lay the foundation for improving cancer prevention, early detection and treatment.
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, 2.5 petabytes of data describing tumor tissue and matched normal tissues from more than 11,000 patients, is publically available and has been used widely by the research community. The data have contributed to more than a thousand studies of cancer by independent researchers and to the TCGA research network publications.
This data can be downloaded from the data portal in TCGA (https://portal.gdc.cancer.gov/) but is in an aweful format. Thankfully, the team at Recount2 https://jhubiostatistics.shinyapps.io/recount/ have reprocessed all of the TCGA data. In the following, I have code to download the TCGA melanoma data. Many of these code lines only need to be run once.
Read in the data and experiment with normalization.
Can you see any structure in a PCA?
Is this structure explained by any variables in the clinical data?
Let’s start by reading in the data.
### Install recount from Bioconductor, do this once only!!!!
### install.packages('BiocManager') BiocManager::install('recount') If you have
### issues installing recount, try first installing 'xfun' and rlang'then
### installing 'recount' as below install.packages('xfun')
### install.packages('rlang') BiocManager::install('recount')
library(recount)
library(dplyr)
### Browse the vignetets for a quick description of how to use the package
### browseVignettes('recount')
### Download the RangedSummarizedExperiment object at the gene level for TCGA
### melanoma data. url <-
### 'http://duffel.rail.bio/recount/v2/TCGA/rse_gene_skin.Rdata' View the url
### for the file by printing the object url url Download the file
### download.file(url = url, destfile = 'rse_gene_skin.Rdata', method = 'auto',
### mode = 'wb')
### Load the data
load("rse_gene_skin.Rdata")
### rse_gene is a RangesSummarizedExperiment
rse_gene
### Gene counts
counts <- assay(rse_gene)
### Gene identifiers
genes <- rowData(rse_gene)
### Phenotype data for the samples
samples <- as.data.frame(colData(rse_gene))
### There are 864 variables in samples, 593 are completely missing
samples %>%
is.na() %>%
colSums() %>%
table()