Aims: 
 to become comfortable manipulating data. 
 to perform exploratory data analysis (EDA) via clustering. 
 to explore the affects of missingness.


1 Preparation for lab

  • Please make sure you have completed the computer labs for the first three weeks.
  • Find a dataset related to your first assessment. This does not have to be the one that you will eventually use.

2 Group work

The link for this week’s google docs is here. You may wish to set up something similar to this for your group assessment.

2.1 Group dynamics

Form groups of 4 or 5 for your first group work assessment task.

  • Fill out your baseball card.

  • Share with your group what your strengths are.

  • Is there anything that your would like an opportunity to improve?

Setting expectations early can help groups function well.

  • What do you hope to get out of the assessment task? Good marks? Fun? Learning?

  • How is your group going to allocate tasks? Will you have a group leader or share this responsibility?

2.2 Missingness

In your groups, choose a dataset that you have been considering using for your assessment. Without looking at the dataset in too much detail:

  • How would you cluster the variables in your dataset?
  • Which variables do you think would have the highest missingness?
    • Why? What would drive the missingness?
    • Would this missingness be MCAR, MAR or MNAR?
  • What associations do you expect to be present in your dataset?

3 Data exercises

3.1 Wisconsin breast cancer study

This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumours for 699 patients up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10, and the outcome is also known. There are 699 rows and 11 columns. This dataset is stored as “biopsy” when the MASS package is loaded.

Load some libraries.

library(MASS)
library(tidyverse)
library(naniar)
library(dendextend)
  • Q1: Read in data, are there any missing observations?
Data <- biopsy
dim(Data)

# Remove ID column
Data <- dplyr::select(Data, -ID)

# How many missing values are there?
sum(is.na(Data))

# Visualize missingness
vis_miss(Data)
  • Q2: Perform case deletion.
Data <- na.omit(Data)

# Maybe Data <- Data[rowSums(is.na(Data))==0,]
  • Q3: Exclude the class variable and then scale and center the variables.
# Scale the column
DataScaled <- Data %>%
  dplyr::select(-class) %>%
  scale()
  • Q4: Perform k-means clustering.
set.seed(51773)
kM <- kmeans(DataScaled,2)

# Plot results. 
# We will show later on in the course (week 7) that we can use dimension
# reduction strategies such as principal component analysis (PCA) to visualise 
# multi-dimensional datasets in two dimensions. Without knowing how PCA works, you
# can still hopefully see that this plot shows that cluster_1 appears to be capturing
# signal in the data that is representative of the malignant group.

pca <- prcomp(DataScaled)
df <- data.frame(pca$x, cluster = paste("cluster", kM$cluster, sep = "_"), Data)
ggplot(df, aes(x = PC1, y = PC2, colour = class, shape = cluster)) + geom_point()

# Can you see the same story by just plotting pairs of variables together? ie V1 vs V7
  • Q5: Perform hierarchical clustering with k = 2.
# Perform clustering on the scaled data
hC <- hclust(dist(DataScaled))

# Plot the dendrogram
plot(hC)

# We can cut the dendrogram at a height such that there are two clusters
hCC <- cutree(hC, k = 2)

# There are lots of packages for plotting dendrograms
hC %>% 
  as.dendrogram() %>%
  set("branches_k_color", k = 2) %>% 
  set("labels","") %>%
  plot() 
  • Q6: Compare clusterings to tumour class. What conclusions could you make and why?
DATA <- data.frame(Data, kmeans = kM$cluster, hclust = hCC)

(tabk <- table(DATA$kmeans, DATA$class))
chisq.test(tabk)

(tabh <- table(DATA$hclust, DATA$class))

(tabhk <- table(DATA$hclust, DATA$kmeans))
  • Q7: Do the results change if you do not scale the variables? Try modifying the code above to do this before looking at the code below.
Data2 <- Data %>%
  dplyr::select(-class)

set.seed(51773)
kM2 <- kmeans(Data2,2)
pca2 <- prcomp(Data2)
df2 <- data.frame(pca2$x, cluster = paste("cluster", kM2$cluster, sep = "_"), Data)
ggplot(df2, aes(x = PC1, y = PC2, colour = class, shape = cluster)) + geom_point()

3.2 Extension

  • E1: When performing hierarchical clustering try experimenting with distance and clustering metrics. ?dist and ?hclust
  • E2: When performing kmeans clustering, how stable are the results?

4 Lab report

Only this section needs to be included in your report. This report will be due at the end of Week 6, Friday 11:59pm 5th April.

While I expect that you should explore the ANZdata, I am not opposed to you submiting a report on a dataset that you are incredibly engaged with.

Report purpose

The purpose of this report is to:

  1. Provide you with an opportunity to apply the methods and skills you have been exposed to in class.

  2. Create a document that can aid you when asking for feedback from tutors.

  3. Practice creating reports for your mid-semester exam.

Because of this, do not feel restricted to using the dataset I have provided. Feel free to download any dataset that you find interesting and apply what you have learnt in class.

Report guidelines

There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:

  • Think about how your report might be structured (eg. Summary, Introduction, Results, Conclusion).
  • Do your figures have captions?
  • Are your figures legible?
  • Are you using sub-headings effectively?

It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”

Marking criteria

  • 1 mark - Chi-square test – Performs a chisquare test, check assumptions, states whether test of independence or homogeneity, reports odds ratio or relative risk.
  • 0.5 mark - Clustering – Performs clustering which may or may not provide obvious insight into dataset.
  • 1 mark - ANOVA – Fits a two-way ANOVA model. States hypothesis clearly, checks assumptions, makes decision and provides interpretation.
  • 1 mark - Regression – Fits a linear regression model. States hypothesis clearly, checks assumptions, makes decision and provides interpretation.
  • 0.5 mark - Presentation – Rmarkdown, no extra R output, use of headings.

Lab instructions

4.1 Week 4 - ANZdata

ANZDATA is the Australia and New Zealand Dialysis and Transplant Registry. The Registry records the incidence, prevalence and outcome of dialysis and transplant treatment for patients with end stage renal failure. ANZDATA is located at The Royal Adelaide Hospital, South Australia.

Australia and New Zealand Dialysis and Transplant Registry

Data and dictionaries

4.2 Questions?

  • Are there missing data?
  • Are there interesting clusters of variables or subjects?

4.3 Data

Let’s start by reading in the data and doing some data wrangling.

library(tidyverse)
# set your working directory to the same location as the data.
# setwd('/dataLocation/')

patientData <- read.csv('42264_AnzdataPatients.csv')
transplantData <- read.csv('42264_AnzdataTransplants.csv')
creatineData <- read.csv('42264_AnzdataTransplantSerumCreatinine.csv')

# For simplicity let's ignore patient's multiple transplants.
# What does the ! do?
transplantData2 <- transplantData[!duplicated(transplantData$id),]

# Join patientData and transplantData
Data <- full_join(patientData,transplantData2, by = "id")

# Extract some interesting variables
use <- c('gendercode','latereferralcode','creatinineatentry','height','weight','smokingcode','cancerever','chroniclungcode','coronaryarterycode','peripheralvascularcode','cerebrovasularcode','diabetescode','graftno','transplantcentrestate','recipientantibodycmvcode','recipientantibodyebvcode','donorsourcecode','donorage','donorgendercode','ischaemia','ageattransplant','hlamismatchesa','hlamismatchesb','hlamismatchesdr','hlamismatchesdq','maxcytotoxicantibodies','currentcytotoxicantibodies','timeondialysis', 'transplantstatus')

Data <- Data[,use]

# Recode some NA
Data[Data==""]  = NA
Data[Data=="-"]  = NA

4.4 Missingness

We will now visualise the missingness and perform case deletion. Consider imputing instead and compare results.

library(tidyverse)
library(naniar)

# create a heatmap of all the missing observations
vis_miss(Data)

# Which variables have the most missing data?
sort(colSums(is.na(Data)))

# Remove the variable with the most missingness
Data <- dplyr::select(Data,-hlamismatchesdq)
vis_miss(Data)

# Remove observations with any missingness
Data <- na.omit(Data)
dim(Data)

# recode factors to remove unobserved levels
Data <- droplevels(Data)

4.5 Data for clustering

Cluster only works on numerical variables. We can recode categorical data as vectors of 0’s and 1’s.

# Convert data to numerical ignore transplant status and intercept
# look at the examples in ?model.matrix
# the . means use all columns, the minus means exclude transplantstatus
# We will talk more about this weird notation in Week 6.

DataM <- model.matrix(~ . -transplantstatus,Data)[,-1]

# Let's only consider categories with more than 10 observations
DataM <- DataM[,colSums(DataM!=0)>=10]

# Extension... google Gower's distance

4.6 Rmarkdown report

Provide your data analytics code as well as a summary of your findings in a reproducible report (e.g. Rmarkdown report).