Aims:
to perform a linear regression analysis, logistic regression analysis and survival analysis to describe relationships between variables.
to examine the appropriateness of your linear model using different graphical tools.
to produce a data analytic report in Rmarkdown
For the exercises in this lab it is expected that you will work in groups as it will expose you to different views, ideas and opinions. However, please make an effort to complete your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations.
In your groups go to the white board / google docs , discuss and brain-storm the following:
Using plots or other visual aids, visualise how regression can be used to partition the variation in an outcome in a similar way to ANOVA.
With the datasets you have collected for your assignment, list various combinations of variables that could require chi-square tests, ANOVA or regression to test for associations.
Imagine that you could magically find some additional data for your project. What variables do you wish you had to ask interesting questions? What tests would these questions require?
For practice with regression, please revise the class survey exercise from the lecture found at https://wimr-genomics.vip.sydney.edu.au/AMED3002/material/ClassSurvey22.zip.
The data from the heart transplant example in lecture can be found in the survival package in R. It was loaded as follows:
library(survival)
heart1 = jasa
heart1 = na.omit(heart1)
mscore2 = as.factor(heart1$mscore > median(heart1$mscore))
outcome = rep(NA, nrow(heart1))
outcome[heart1$futime/365 >= 1 & heart1$reject == 0] = "no"
outcome[heart1$futime/365 < 1 & heart1$reject == 1] = "yes"
outcome = as.factor(outcome)
score = as.numeric(heart1$mscore)
Only this section needs to be included in your report. This report will be due at the end of Week 6, Friday 11:59pm 1st April.
While I expect that you should explore the ANZdata, I am not opposed to you submiting a report on a dataset that you are incredibly engaged with.
Report purpose
The purpose of this report is to:
Provide you with an opportunity to apply the methods and skills you have been exposed to in class.
Create a document that can aid you when asking for feedback from tutors.
Practice creating reports for your mid-semester exam.
Because of this, do not feel restricted to using the dataset I have provided. Feel free to download any dataset that you find interesting and apply what you have learnt in class.
Report guidelines
There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:
It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”
Marking criteria
Lab instructions
ANZDATA is the Australia and New Zealand Dialysis and Transplant Registry. The Registry records the incidence, prevalence and outcome of dialysis and transplant treatment for patients with end stage renal failure. ANZDATA is located at The Royal Adelaide Hospital, South Australia.
Use regression to perform some hypothesis tests.
A example questions might be:
Is there a relationship between height and weight?
Which variables explain the variation in time on dialysis the best?
Is there a model which predictions rejection within 1 year?
Also consider looking at the questions using survival analysis
Let’s start by reading in the data and doing some data wrangling.
library(tidyverse)
# set your working directory to the same location as the data.
# setwd('/dataLocation/')
patientData <- read.csv("42264_AnzdataPatients.csv")
transplantData <- read.csv("42264_AnzdataTransplants.csv")
creatineData <- read.csv("42264_AnzdataTransplantSerumCreatinine.csv")
# For simplicity let's ignore patient's multiple transplants. What does the !
# do?
transplantData2 <- transplantData[!duplicated(transplantData$id), ]
# Join patientData and transplantData
Data <- full_join(patientData, transplantData2, by = "id")
# Extract some interesting variables
use <- c("gendercode", "latereferralcode", "creatinineatentry", "height", "weight",
"smokingcode", "cancerever", "chroniclungcode", "coronaryarterycode", "peripheralvascularcode",
"cerebrovasularcode", "diabetescode", "graftno", "transplantcentrestate", "recipientantibodycmvcode",
"recipientantibodyebvcode", "donorsourcecode", "donorage", "donorgendercode",
"ischaemia", "ageattransplant", "hlamismatchesa", "hlamismatchesb", "hlamismatchesdr",
"hlamismatchesdq", "maxcytotoxicantibodies", "currentcytotoxicantibodies", "timeondialysis",
"transplantstatus")
Data <- Data[, use]
# Recode some NA
Data[Data == ""] = NA
Data[Data == "-"] = NA
## You can read in "dates" in R using as.Dates
#### Make Data L
DataPrep <- full_join(patientData, transplantData2, by = "id") use <- c("gendercode", "lastfollowupdate", "transplantdate", "graftfailuredate", "transplantdate", "ageattransplant")
DataL <- DataPrep[, use] DataL[DataL == ""] = NA DataL[DataL == "-"] = NA
## Time till last follow-up
timeFU = as.Date(DataL$lastfollowupdate, format = "%d%b%Y") - as.Date(DataL$transplantdate, format = "%d%b%Y")
## Time till rejection
timeFailure = as.Date(DataL$graftfailuredate, format = "%d%b%Y") - as.Date(DataL$transplantdate, format = "%d%b%Y")
## Alternative use timeFailure = DataL$transplantperiod
## Which grafts were failures?
failures = !is.na(timeFailure)
## Simple survival model
library(survival)
fit = coxph(Surv(timeFailure,failures)~DataL$ageattransplant)
fit
plot(survfit(Surv(timeFailure/365,failures)~DataL$gendercode),col = 1:2,lty = 1:2,xlab = 'years',ylab = 'Probability of success')
legend('topright',c('Women','Men'),col = 1:2,lty = 1:2)
fit = coxph(Surv(timeFailure,failures)~DataL$gendercode)
fit