1 Preparation for lab

For the exercises in this lab it is expected that you will work in groups as it will expose you to different views, ideas and opinions. However, please make an effort to complete your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations.

2 Data exercises

2.1 ToothGrowth

The data set ToothGrowth.txt has measurements of tooth growth (len) of guinea pigs for different dosages of Vitamin C (dose) and two different delivery methods (supp). The response is the length of odontoblasts (len) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid). (McNeil, et al, 1977) References: McNeil, D. R. (1977) Interactive Data Analysis. New York: Wiley. Perform a two-way analysis of variance of tooth growth modelled by dosage and delivery method. The following questions help you with this analysis.

    1. Obtain boxplots of the data.
library(ggplot2)
library(tidyr)
library(datasets)
# The ToothGrowth dataset is loaded in the 'datasets' package
tooth = ToothGrowth
ggplot(tooth, aes(y = len, x = supp, colour = factor(dose))) + geom_boxplot() + theme_classic() +
    xlab("Supplement") + ylab("Length") + labs(colour = "Supplement")
    1. Consider only Vitamin C supplement, assess the effect of different dose levels. What is the difference between aov1 and aov2 ?
VCdata = tooth[tooth[, "supp"] == "VC", ]
dim(VCdata)
head(VCdata)

boxplot(len ~ dose, VCdata)

aov1 = aov(len ~ dose, VCdata)
summary(aov1)

aov2 = aov(len ~ factor(dose), VCdata)
summary(aov2)

ggplot(VCdata, aes(y = len, x = factor(dose))) + geom_boxplot() + theme_classic() +
    xlab("Dose") + ylab("Length")


## Another way of fitting anova in R
fit = lm(len ~ factor(dose), VCdata)
anova(fit)
    1. Comment on the assumptions needed to fit anova?
    1. Obtain an interaction plot to visualise the interaction between dose and supplement on tooth length and comment on the meaning of the plot. This can be using the function interaction.plot as well as in ggplot.
# Change dose to factor
tooth$dose = as.factor(tooth$dose)

## Without ggplot
attach(tooth)
interaction.plot(dose, supp, len)
detach()

## With ggplot
ggplot(tooth, aes(x = factor(dose), y = len, colour = supp)) + geom_boxplot() + stat_summary(fun.y = mean,
    geom = "line", aes(group = supp)) + theme_classic() + xlab("Dose") + ylab("Length") +
    labs(colour = "Supplement")
    1. Fit the full model including interactions and obtain the corresponding analysis of variance table. Next, use the F-test to compare the full model with the additive model (no interaction model) and comment on the results.
tooth.aov = aov(len ~ supp * dose, tooth)
summary(tooth.aov)

tooth.aov.additive = aov(len ~ supp + dose, tooth)
summary(tooth.aov.additive)

anova(tooth.aov, tooth.aov.additive)

3 Lab report

Only this section needs to be included in your report. This report will be due at the end of Week 6, Friday 11:59pm 29th March.

While I expect that you should explore the ANZdata, I am not opposed to you submiting a report on a dataset that you are incredibly engaged with.

Report purpose

The purpose of this report is to:

  1. Provide you with an opportunity to apply the methods and skills you have been exposed to in class.

  2. Create a document that can aid you when asking for feedback from tutors.

  3. Practice creating reports for your mid-semester exam.

Because of this, do not feel restricted to using the dataset I have provided. Feel free to download any dataset that you find interesting and apply what you have learnt in class.

Report guidelines

There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:

  • Think about how your report might be structured (eg. Summary, Introduction, Results, Conclusion).
  • Do your figures have captions?
  • Are your figures legible?
  • Are you using sub-headings effectively?

It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”

Marking criteria

  • 1 mark - Chi-square test – Performs a chisquare test, check assumptions, states whether test of independence or homogeneity, reports odds ratio or relative risk.
  • 0.5 mark - Clustering – Performs clustering which may or may not provide obvious insight into dataset.
  • 1 mark - ANOVA – Fits a two-way ANOVA model. States hypothesis clearly, checks assumptions, makes decision and provides interpretation.
  • 1 mark - Regression – Fits a linear regression model. States hypothesis clearly, checks assumptions, makes decision and provides interpretation.
  • 0.5 mark - Presentation – Rmarkdown, no extra R output, use of headings.

Lab instructions

3.1 Week 4 - ANZdata

ANZDATA is the Australia and New Zealand Dialysis and Transplant Registry. The Registry records the incidence, prevalence and outcome of dialysis and transplant treatment for patients with end stage renal failure. ANZDATA is located at The Royal Adelaide Hospital, South Australia.

Australia and New Zealand Dialysis and Transplant Registry

Data and dictionaries

3.2 Questions?

Use ANOVA to perform some hypothesis tests.

As some example questions, consider comparing

  • creatinineatentry and transplantcentrestate.
  • timeondialysis and hlamismatchesb.

Also consider looking for interaction effects

  • timeondialysis with interaction between hlamismatchesb and sex.

What are your conclusions? Make sure to check your assumptions.

Tip: If your data is right-skewed, try a square-root or a log transform. Does this alter your conclusion? Why?

3.3 Data

Let’s start by reading in the data.

library(tidyverse)
# set your working directory to the same location as the data.
# setwd('/dataLocation/')

patientData <- read.csv("42264_AnzdataPatients.csv")
transplantData <- read.csv("42264_AnzdataTransplants.csv")
creatineData <- read.csv("42264_AnzdataTransplantSerumCreatinine.csv")

# For simplicity let's ignore patient's multiple transplants.  What does the !
# do?
transplantData2 <- transplantData[!duplicated(transplantData$id), ]

# Join patientData and transplantData
Data <- full_join(patientData, transplantData2, by = "id")

# Extract some interesting variables
use <- c("gendercode", "latereferralcode", "creatinineatentry", "height", "weight",
    "smokingcode", "cancerever", "chroniclungcode", "coronaryarterycode", "peripheralvascularcode",
    "cerebrovasularcode", "diabetescode", "graftno", "transplantcentrestate", "recipientantibodycmvcode",
    "recipientantibodyebvcode", "donorsourcecode", "donorage", "donorgendercode",
    "ischaemia", "ageattransplant", "hlamismatchesa", "hlamismatchesb", "hlamismatchesdr",
    "hlamismatchesdq", "maxcytotoxicantibodies", "currentcytotoxicantibodies", "timeondialysis",
    "transplantstatus")

Data <- Data[, use]

# Recode some NA
Data[Data == ""] = NA
Data[Data == "-"] = NA