Aims: 
 to practice statistical thinking on graphical summaries 
 to produce a data analytic report in Rmarkdown
 to become familiar with ggplots

1 Preparation for lab

Complete one of the introduction to R resources given in the pre-class material at home. Review materials from your introductory statistics course.

For the exercises in this lab it is expected that you will work in groups as it will expose you to different views, ideas and opinions. However, please make an effort to complete your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations.

“The greatest value of a picture is when it forces us to notice what we never expected to see.”-John W. Tukey

2 Group work

At the beginning of this lab break up into groups of four or five. This tutorial is designed to generate discussion and provide an opportunity to practise statistical thinking and communicating statistical concepts. There are no “correct” answers. As a group discuss and brainstorm the following:

How do you think succeeding in this course could help your career?
Discuss which researcher you have chosen to write a profile on and what you have found.
Using the Framingham Heart Study data dicitonary:
- what questions could you ask?
- draft some visualisations you could use to answer these questions.
- are there any properties of the data which might confound the question or make it difficult to answer?
- which of the questions you brainstormed were the hardest to answer visually?
Present the outcomes of your discussion to the rest of the class using the white board.

3 Data exercises

3.1 Sleep

This is a historical dataset given as an example in William Gosset’s 1908 paper on The Probable Error of a Mean. In this paper, Gosset introduced a form of what later became known as Student’s t-distribution. The data used in his paper was actually taken from a table by A. R. Cushny and A. R. Peebles in the Journal of Physiology for 1904, showing the different effects of the optical isomers of hyoseyamine hydrobromide in producing sleep. Gosset describe the data as “the sleep of 10 patients was measured without hypnotic and after treatment (1) with D. hyoseyarnine hydrobromide, (2) with L. hyoseyamine hydrobromide.”

The average number of hours’ sleep gained by the use of the drug is provided in the file StudenttSleepData.csv. The interest of this study is to understand the effectiveness of the two treatments.

What questions (or hypothesis) can you formulate here ?
How would you visualize this data to best illustrate the results ? Thinking about what you will expect to see if you are drawing a scatter plot, boxplot, or barplots.
We will use this very small dataset to learn how to create graphics using ggplots. Additional resources can be found in the book R for Data Science.

Press the “code” button on the right for to see how we might approach this in R. For most weeks I recommend to try… 1. running the lines of code as is without thinking too hard, 2. then think about what each line might be doing, 3. discuss this as a group and ask the tutor for help or feedback and 4. try altering the code and see what happens (mostly in the data wrangling this week.) These code snippets are not always complete and are designed to be worked through collaboratively with discussion.

## Load the R packages that we need. ###########################################
## These contain a bunch of useful functions.  Do you know how to install them? In
## RStudio Go to Tools -> Install Packages -> Install from Repository.

library(ggplot2)
library(tidyr)

## Read in the data. ###########################################################
## This data is stored on Ellis' website but the same command can be used to read
## in .csv files from your personal computer. Try downloading it locally and then
## reading it in.  What does row.names = 1 mean? Look at he ugly help
## documentation using help('read.csv') or ?read.csv

sleep <- read.csv("http://www.maths.usyd.edu.au/u/ellisp/AMED3002/data/StudentTSleepData.csv", 
    row.names = 1)

## Look at the data. ###########################################################

sleep

## Reshape the data.  ##########################################################
## There is a lot to unpack here.  First run it and think 'what did this code
## do?'.  Now move on and then after you've looked at some plots come back and
## talk about what $, factor, pivot_longer, coles, names_to and values_to actually
## mean.  You might like to look at help('pivot_longer') or vignette('pivot').

sleep$patient <- factor(1:10)
sleepLong <- pivot_longer(sleep, cols = c("Dextro", "Laevo"), names_to = "treatment", 
    values_to = "hours")
head(sleepLong)

## Different types of plots. ###################################################
## What different aspects of the data do each of these plots highlight?

## Boxplot
ggplot(sleepLong, aes(x = treatment, y = hours, col = treatment)) + geom_boxplot()

ggplot(sleepLong, aes(x = treatment, y = hours, col = treatment)) + geom_boxplot() + 
    geom_point()

ggplot(sleepLong, aes(x = treatment, y = hours, col = treatment)) + geom_boxplot() + 
    geom_point() + theme_classic()

## Line plot
ggplot(sleepLong, aes(x = treatment, y = hours, group = patient, col = patient)) + 
    geom_line()

## Density plot
ggplot(sleepLong, aes(x = hours, group = treatment, col = treatment)) + geom_density()

## Violin plot
ggplot(sleepLong, aes(x = treatment, y = hours, col = treatment)) + geom_violin() + 
    geom_point()

## Bar plots
ggplot(sleepLong, aes(x = patient, y = hours, fill = treatment)) + geom_bar(stat = "identity")

ggplot(sleepLong, aes(x = patient, y = hours, fill = treatment)) + geom_bar(stat = "identity", 
    position = position_dodge())

ggplot(sleepLong, aes(x = patient, y = hours, fill = treatment)) + geom_bar(stat = "identity", 
    position = position_dodge()) + labs(title = "Sleep treatment", subtitle = "Bar plot", 
    caption = "(based on data from Table 1 in Student, 1908)") + ylab("hours")

3.2 Sleep (ext)

Here is the section from taken from the actual paper through JStore website. Spot the typo in the table.

4 Lab report

Only this section needs to be included in your Module 1 Lab Report to be handed in at the end of Week 3.

Report guidelines

There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:

Think about how your report might be structured (eg. Summary, Introduction, Results, Conclusion).
Do your figures have captions?
Are your figures legible?
Are you using sub-headings effectively?

It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”

Marking criteria

0.5 mark - Visualisation – At least 2 plots.
0.5 mark - Communication – Context around results.
0.5 mark - Communication – Insightful context around results.
0.5 mark - Presentation – No extra R output, use of headings.
0.5 mark - Innovation – Try at least one of captions, table of contents, embedded numbers in text, a plot that isn’t a bar-plot, boxplot or histogram etc.

Lab instructions

4.1 Domain knowledge

The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease. The data we will be investigating is a subset of the data collected.

Who collects this data and how is is reported?

Resource

Is there anything unusual about the data I have given you?
How are missing values recorded, and why might they occur?

Resource

4.2 Example questions

Are people that had heart attacks older than those that didn’t?
Do men have higher blood pressure than women ?
Do men have more heart attacks than women ?

4.3 Load data

Set your working directory using the function setwd() to where your data is stored and download the data directly from the website. (http://www.maths.usyd.edu.au/u/ellisp/AMED3002/data/frmgham.csv)
Check the size of your data. Think about what the number of rows actually means.

## Loading data directly from the web
heartData <- read.csv("http://www.maths.usyd.edu.au/u/ellisp/AMED3002/data/frmgham.csv")

## If your data file is in the same folder, you can also use code
heartData <- read.csv("frmgham.csv")

## size of the data set
dim(heartData)

How does R classify your R object heartData?

class(heartData)

How does R classify each of the variables in your R object heartData?

str(heartData)

You may like to redefine the classification of a variable differently. This component is more critical later in the course.

sex <- heartData$SEX
sexC <- as.character(sex)
sexF <- factor(sexC, levels = c(1, 2), labels = c("Men", "Women"))
class(sex)
class(sexC)
class(sexF)

4.4 Explore variables

Select a univariate variable and explore using what you have learn from MATH1X05 or DATA1X01.

Produce numerical summaries

head(heartData$SYSBP)
summary(heartData$SYSBP)
table(heartData$SEX)

Try using some ‘base R’ graphics.

hist(heartData$SYSBP, prob = T)
boxplot(heartData$SYSBP)

Customise the plots.

hist(heartData$SYSBP, freq = FALSE, main = "Histogram", ylab = "Probabilities", col = "green")
boxplot(heartData$SYSBP, horizontal = TRUE, col = "red")

Looking at relationship between Age and Sex.

boxplot(SYSBP ~ SEX, data = heartData)

Try this using ggplot.

library(tidyverse)
## Graphical summary of the variable 'Age'
ggplot(heartData, aes(x = 1, y = SYSBP)) + geom_boxplot()

## Relatoinship between Age and Sex.
ggplot(heartData, aes(x = factor(SEX), y = SYSBP)) + geom_boxplot()

## Adding some colors
ggplot(heartData, aes(x = factor(SEX), y = SYSBP, fill = SEX)) + geom_boxplot()

ggplot(heartData, aes(x = factor(SEX), y = SYSBP, col = SEX)) + geom_boxplot()

(Extension) Generate a violin plot, does it take longer to plot ?

ggplot(heartData, aes(x = factor(SEX), y = SYSBP)) + geom_violin()

4.5 Rmarkdown report

Provide your data analytics code as well as summarising your findings in reproducible report (e.g. Rmarkdown report).

AMED3002 Week 1

Data visualisation