Using ggtree: A lesson in COVID phylogenetics

Versioning

This tutorial was prepared using the latest versions of R (version 4.3.0) and Rstudio (version 2023.03.0 Build 386) as of 2nd May 2023.

Required packages

Install the following packages using bioconductor.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ggtree") #This will also install ggplot2, ape and RColorBrewer
BiocManager::install("ggtreeExtra") #This will also install ggnewscale
BiocManager::install("ape")
BiocManager::install("readr")

Load the packages.

library(ggtree)
library(ggplot2)
library(ggtreeExtra)
library(ggnewscale)
library(ape)
library(readr)
library(RColorBrewer)

Background to this dataset

We are going to explore a phylogenetic tree that contains a representative set of SARS-CoV-2 strains that have been identified from local and imported cases in Australia. The tree was build using whole-genome sequences using a maximum likelihood approach. There is one sequence from a Patient (001) that was identified from a local case. We would like to determine the possible source of this virus by comparing how it related to viruses previously sequenced in the country. Furthermore, we would like to see what lineage the virus has been assigned to and if it is a known Variant of Concern (V.O.C.) that regarding increased transmissibility/immune escape.

You are encouraged the work through the content here and then we will discuss the findings together at the end.

One of the biggest takeaways from this practical is the importance of being able to visualize multiple layers of information in a single figure to (1) extract meaning and (2) communicate the right message.

Loading the phylogenetic tree

The tree is in nexus format so we need to use the read.nexus() function for importing.

tree <- read.nexus("COVID.tree.2022.nexus") #Adjust to ensure path to file on your own computer is correct

Lets check the tree has been loaded correctly by displaying it.

basic_tree <- ggtree(tree) #For simplicity, we will define the basic tree as a variable, which is read by the ggtree() function
basic_tree

The tree currently lacks a few important features such as tip labels (i.e. the sequence names) and a scale bar as the branch lengths are our measure of genetic distance. These can be added sequentially using the geom_tiplab() and geom_treescale() functions for the names and scale, respectively.

basic_tree + geom_tiplab() + geom_treescale()

We can add parameters to those functions to adjust the sizes of the different elements in the tree.

basic_tree + geom_tiplab(size=3) + geom_treescale(fontsize=3,width=0.0001,y=-2) # size and fontsize are setting text sizes, width is manually setting the length of the scale bar while, y is re-positioning the scale bar down

Incorporating metadata

These current sequence labels are accession numbers that are unique identifiers from our source, which was the GISAID public repository of SARS-CoV-2 sequences. Importantly, each accession number can be linked with different kind of metadata such as the virus strain name,the state is was identified, the date of collection, and anything else that might be relevant. We have a metadata table of such information that we can load, and try to incorporate with our tree.

metadata = read_delim("COVID.metadata.2022.tsv",delim='\t',na="xxx") #Adjust to ensure path to file on your own computer is correct

## Rows: 42 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (7): Accession, Strain, Region, Country, State, Lineage, Strain_Date
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

To get a better idea of the data, you can use the head() function to have a look at the top few lines of the table.

head(metadata)

## # A tibble: 6 × 8
##   Accession       Strain     Date       Region Country State Lineage Strain_Date
##   <chr>           <chr>      <date>     <chr>  <chr>   <chr> <chr>   <chr>      
## 1 MN908947        Wuhan-1    2019-12-26 Asia   China   Wuhan B       Wuhan-1/20…
## 2 EPI_ISL_498544  ACT0086    2020-07-03 Ocean… Austra… Aust… D.2     ACT0086/20…
## 3 EPI_ISL_1170947 ACT0096    2021-03-03 Ocean… Austra… Aust… B.1.351 ACT0096/20…
## 4 Patient_001     Patient_0… 2021-03-03 Ocean… Austra… New … B.1.1.7 Patient_00…
## 5 EPI_ISL_1121985 NSW-R0096  2021-02-27 Ocean… Austra… New … B.1.1.7 NSW-R0096/…
## 6 EPI_ISL_1121986 NSW-R0097  2021-02-27 Ocean… Austra… New … B.1.1.7 NSW-R0097/…

We need to link the metadata table to the sequence names by appending it as a data frame onto a tree. We will create a new variable that defines this annotated tree.

annotated_tree <- basic_tree %<+% as.data.frame(metadata)

If we print the annotated tree, it won’t look any different to the basic tree. We can use aes(label) function to define new labels with geom_tiplab(). Thw example below uses the strain name (“Strain” field) from our metadata table.

annotated_tree +
   geom_tiplab(aes(label=Strain),size=3) +
   geom_treescale(width=0.0001,y=-2)

We can choose any of the fields in out metadata table and display them on the tips of our tree.

annotated_tree +
   geom_tiplab(aes(label=Lineage),size=3) +
   geom_treescale(width=0.0001,y=-2)

Highlighting sequences

ggtree has many different features for highlighting individual or groups of sequences. Part of doing this requires use to have a system where we can refer to specific nodes (where sequences converge together along branches) and taxa (the external branches/tips). ggtree uses a simple numbering system that we can toggle on using display using the geom_text(aes(label=node) function.

annotated_tree +
 geom_tiplab(aes(label=Lineage),size=3) +
 geom_treescale(width=0.0001,y=-2) +
 geom_text(aes(label=node))

Here everything is a bit squished and difficult to read. This can fixed in a couple of steps. First, lets push all the sequence labels to the right using the align parameter in geom_tiplab() along with defining the linesize that links the tip to our label ensuring it is thin and doesn’t block text. We then re-size and re-position the node numbers using the size and hjust parameters in geom_text().

annotated_tree +
 geom_tiplab(aes(label=Lineage),align=TRUE,linesize=0.1,size=3) +
 geom_treescale(width=0.0001,y=-2) +
 geom_text(aes(label=node),size=3,hjust=-0.5)

As displayed there should be a number assigned to each node and tips (yes, it’s still a bit messy!). We would like to identify the clades (nodes that define a common group of sequences) for a number of important lineages including the V.O.C.’s and the Melbourne second wave cluster. Can we identify the common node (number) for the following lineages and record the value that is written (just to the right of the nodes/tips).

See if you can find the node number for the following clades:

#B.1.1.7 = node ??? (V.O.C.)
#B.1.351 = node ??? (V.O.C.)
#D.2 = node ??? (Melbourne lineage)

You can check your values here:

#B.1.1.7 = node 74 (V.O.C.)
#B.1.351 = node 59 (V.O.C.)
#D.2 = node 68 (Melbourne lineage)

With these node values, we can now highlight them using a filled block over that group with the geom_hilight() function. The parameters ‘node’ defines the node number, while ‘fill’ defines the color.

annotated_tree +
 geom_tiplab(aes(label=Lineage),align=TRUE,linesize=0.1,size=3) +
 geom_treescale(width=0.0001,y=-2) +
 geom_hilight(node=74,fill="red") +
 geom_hilight(node=59,fill="green") +
 geom_hilight(node=68,fill="yellow")

We are now starting to build a nice picture of what SARS-CoV-2 strains have been in circulation in Australia. While the broader lineage information is important, we will need to bring back the strain names to do a more in-depth analysis of where our Patient’s sample (Patient_001) might sit within this diversity. As we were doing before, use the geom_tiplab(aes(label=)) function to add the Strain_Date field labels onto our tips.

annotated_tree +
 geom_tiplab(aes(label=Lineage),align=TRUE,linesize=0.1,size=3) +
 geom_tiplab(aes(label=Strain_Date),size=3) +
 geom_treescale(width=0.0001,y=-2) +
 geom_hilight(node=74,fill="red") +
 geom_hilight(node=59,fill="green") +
 geom_hilight(node=68,fill="yellow")

You might see that the labels overlap. To fix this, we can use the ‘offset’ parameter of any geom_tiplab() function to push the lineage names to the right. The scale is proportional to our branch lengths (scale bar), so that can serve as a guide for what value to use.

annotated_tree +
 geom_tiplab(aes(label=Lineage),align=TRUE,linesize=0.1,size=3,offset=0.00025) +
 geom_tiplab(aes(label=Strain_Date),size=3) +
 geom_treescale(width=0.0001,y=-2) +
 geom_hilight(node=74,fill="red") +
 geom_hilight(node=59,fill="green") +
 geom_hilight(node=68,fill="yellow")

We can use the same approach as earlier to identify and highlight the node containing Patient_001’s sequence

annotated_tree +
 geom_tiplab(aes(label=Strain_Date),align=TRUE,size=3,linesize=0.1) +
 geom_treescale(width=0.0001,y=-2) +
 geom_text(aes(label=node),size=3,hjust=-0.5)

Look for the node that defines the small cluster of three sequences including Patient_001

#Patient_001 = node 82

Instead of filled boxes, we can highlight this group using a line and text to the right with the geom_cladelabel() function.

annotated_tree +
 geom_tiplab(aes(label=Lineage),align=TRUE,linesize=0.1,size=3,offset=0.00025) +
 geom_tiplab(aes(label=Strain_Date),size=3) +
 geom_treescale(width=0.0001,y=-2) +
 geom_hilight(node=74,fill="red") +
 geom_hilight(node=59,fill="green") +
 geom_hilight(node=68,fill="yellow") +
 geom_cladelabel(node=82,label="Outbreak",offset=0.00027)

Displaying multiple metadata fields

At some point, having multiple text labels gets too difficult to display. ggtree has many different ways to present the metadata fields not as tip labels. Let’s explore how we can add colored bars to the right of sequences to shows multiple metadata fields.

This can be done using the geom_fruit() function. We will use the ‘geom’ parameter to define we want to show color tiles with the mapping paramter defining the metadata field (here Lineages). The width and offset parameters are setting the size and position of the colored tiles.

annotated_tree +
 geom_tiplab(aes(label=Strain_Date),size=3) +
 geom_treescale(width=0.0001,y=-2) +
 geom_hilight(node=74,fill="red") +
 geom_hilight(node=59,fill="green") +
 geom_hilight(node=68,fill="yellow") +
 geom_cladelabel(node=82,label="Outbreak",offset=0.00027) +
 geom_fruit(geom=geom_tile,mapping=aes(fill=Lineage),width = 0.00004,offset=0.22)

We obviously have a few formatting options to fix up. First, lets tackle the color palette.

The library ‘RColorBrewer’ has lots of great options for coloring including color blind friendly options. Have a look at these palettes.

display.brewer.all(colorblindFriendly = TRUE)

We can now define the color palette of choice and set how many unique values we expect to see for the field of interest.

getPalette1 = colorRampPalette(brewer.pal(12, "Paired")) #This is using the Paired color palette with 12 color blocks
colourCount_Lineages = length(unique(metadata$Lineage)) #This records the number of unique values for Lineage

Next, we add the new color palette with the scale_fill_manual() function using what we define above.

annotated_tree +
 geom_tiplab(aes(label=Strain_Date),size=3) +
 geom_treescale(width=0.0001,y=-2) +
 geom_hilight(node=74,fill="red") +
 geom_hilight(node=59,fill="green") +
 geom_hilight(node=68,fill="yellow") +
 geom_cladelabel(node=82,label="Outbreak",offset=0.00027) +
 geom_fruit(geom=geom_tile,mapping=aes(fill=Lineage),width = 0.00004,offset=0.22) +
 scale_fill_manual(values=getPalette1(colourCount_Lineages))

We can re-position the legends to the bottom of the tree to fix the tree dimensions and the text overlaps. This is done using the theme() and guides() functions. The ‘legend.position’ parameter can set the position of the bottom, while the ‘nrows’ helps with layout and number of values per row.

annotated_tree +
 geom_tiplab(aes(label=Strain_Date),size=3) +
 geom_treescale(width=0.0001,y=-2) +
 geom_hilight(node=74,fill="red") +
 geom_hilight(node=59,fill="green") +
 geom_hilight(node=68,fill="yellow") +
 geom_cladelabel(node=82,label="Outbreak",offset=0.00027) +
 geom_fruit(geom=geom_tile,mapping=aes(fill=Lineage),width = 0.00004,offset=0.22) +
 scale_fill_manual(values=getPalette1(colourCount_Lineages)) +
 theme(legend.position="bottom") +
 guides(fill=guide_legend(nrow=7))

Finally, we want to add a second metadata field to show the States were the samples were collected. Liek before, we should define a color palette first for this State data.

getPalette2 = colorRampPalette(brewer.pal(11, "RdYlBu"))
colourCount_States = length(unique(metadata$State))

And then we can use the new_scale_fill() function to append another tile block with the existing Lineage one. This new_scale_fill() function is important otherwise second tile block will overwrite the first.

annotated_tree +
 geom_tiplab(aes(label=Strain_Date),size=3) +
 geom_treescale(width=0.0001,y=-2) +
 geom_hilight(node=74,fill="red") +
 geom_hilight(node=59,fill="green") +
 geom_hilight(node=68,fill="yellow") +
 geom_cladelabel(node=82,label="Outbreak",offset=0.00027) +
 geom_fruit(geom=geom_tile,mapping=aes(fill=Lineage),width = 0.00004,offset=0.22) +
 scale_fill_manual(values=getPalette1(colourCount_Lineages)) +
 theme(legend.position="bottom") +
 guides(fill=guide_legend(nrow=7)) +
 new_scale_fill() +
 geom_fruit(geom=geom_tile,mapping=aes(fill=State),width = 0.00004) +
 scale_fill_manual(values=getPalette2(colourCount_States)) +
 theme(legend.position="bottom") +
 guides(fill=guide_legend(nrow=7))

Research Project 2 group work.

By next week’s workshop (week 11), your group should be ready to update myself or one of the demonstrators on your progress so far with the reproducible report assessment. Questions we will be keen to hear from you about will include:

What is the broad research question of your chosen manuscript?
What does the data look like?
A draft flow diagram of their analysis.
Some interesting figures/results you hope to reproduce.

You should spend time in today’s workshop working towards this assessment and importantly, choosing a manuscript and dataset before leaving today.

Module 4 lab report

Select a classifier that is not SVM and write a report demonstrating that you can build and evaluate a classifier using 5-fold cross-validation with the Stage III Melanoma case study (which you will work on in weeks 11 and 12). It is optional whether you include feature selection outside or within your cross-validation framework. Present your results and code in a reproducible report (e.g. Rmarkdown report). Be complete in writing up your findings. This report contributes to 4% of your final assessment.

The marking is as follows:

1 mark is given for the DE analysis.
1 mark is given for building the classifier.
1 marks are given for generating a cross-validation code.
1 mark is given for writing an explanation of what you have done.

The report is due 11:59pm Friday 17th May.