Aims: 
 to practice statistical thinking on graphical summaries 
 to produce a data analytic report in Rmarkdown
 to become familiar with ggplots


1 Preparation for lab

Complete Sections 2 and 3 of the Guide to R at home. Review materials from your introductory statistics course.

For the exercises in this lab it is expected that you will work in groups as it will expose you to different views, ideas and opinions. However, please make an effort to complete your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations.

“The greatest value of a picture is when it forces us to notice what we never expected to see.”-John W. Tukey

2 Group work

The link for this weeks google docs and Jamboard (digital whiteboard) is here.

At the beginning of this lab break up into groups of four or five. We will allocate zoom breakout rooms for those attending remotely. This group work is designed to generate discussion and provide an opportunity to practise statistical thinking and communicating statistical concepts. There are no “correct” answers. As a group discuss and brainstorm the following:

2.1 Communicating research

Share and discuss the 100 word summary you wrote in the break:

  • Was the summary aimed at a particular market?
  • Was it too technical?
  • Is it accurate and comprehensive? What doesn’t it say?
  • What was the hardest part of the process of researching and writing the summary?
  • What would you do next time?

2.2 Surveys

  1. Design a survey of 4 to 6 questions that you could give to the class. This survey should allow you to answer a health related question.
  2. What types of variables will your questions generate? (Numerical, categorical, continuous, discrete, ordinal, nominal, etc.)
  3. Discuss:
    • Are there questions that are worded to influence the outcome of the response?
    • Is there anything that might confound the answers to some of the questions?
    • Are there any limitations of the survey that will limit its interpretation?
    • How would the way you distribute this survey affect the reponses?
  4. A representative for your group should post your questions on the Canvas Discussion board. We will select some questions out of the most liked posts for use in a course survey.

3 Data exercises

3.1 Cereal

This data is taken from https://dasl.datadescription.com/datafile/cereals/ and can be downloaded from https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/Cereal.csv. One of the variable mfr represents the manufacturer of cereal where A = American Home Food Products, G = General Mills, K = Kelloggs, N = Nabisco, P = Post, Q = Quaker Oats, R = Ralston Purina.

Please note that the code provided should be used as a guide only and not complete solutions.

  • Getting started: Read the data from the website. Check the size of your data. Think about what the number of rows actually means.
library(ggplot2)
library(tidyr)
## Reading in the data
cereal = read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/Cereal.csv")
#cereal = read.csv("Cereals.csv")

## Setup rownames
rownames(cereal) = cereal[,1]

## Looking at the start of the data
head(cereal)

## `colnames(cereal)` and `rownames(cereal)` provides information about the names of the columns and rows in the dataset.
colnames(cereal)
  • Q1: Produce some basic summary statistics the nutrient “sugar”. Why is the minimum value “-1” ? What is the correct average value for the nutrient “sugar”?
summary(cereal[,"sugars"])

## redefine "-1" as NA
cereal[cereal==-1] = NA

summary(cereal[,"sugars"])
  • Q2: Pick a category and find which cereals are the best and worst in a particular category. The filter() function can be used to isolate certain rows and the select() function certain columns.
## We can use the tidyverse with pipes.
library(tidyverse)
cereal %>%
  filter(carbo==min(carbo, na.rm = TRUE)) %>%
  select(name)

cereal %>%
  filter(carbo==max(carbo, na.rm = TRUE)) %>%
  select(name)

## You could also do this in base R
cereal[cereal[,"carbo"] == min(cereal[,"carbo"], na.rm = TRUE), "name"]
## Or
which.max(cereal[,"carbo"])
cereal[which.max(cereal[,"carbo"]),1]
cereal[which.min(cereal[,"carbo"]), "name"]
  • Q3: Which cereal has the highest fiber content?

  • Q4: Examine the distribution of carbohydrates across the different cereal brands. What does this tell us? What are some striking features?

ggplot(cereal, aes(x=mfr, y = carbo, fill=mfr)) + geom_boxplot()
  • Q5: Which manufacture tends to have cereal with very high sugar content? Is there a difference between using the mean or the median? Can you use graphical summaries to visualise the result?
## We can do this in by grouping by manufacturer and then summarising the sugar variable with mean.

cereal %>%
  group_by(mfr) %>%
  summarise(meanSugars = mean(sugars, na.rm = TRUE))

## We could also do this in base R using tapply
tapply(cereal[,"sugars"], cereal[,"mfr"], mean, na.rm = TRUE)

## Let's make a crazy plot. What does each line do?

cereal %>%
  group_by(mfr) %>%
  summarise(meanSugars = mean(sugars, na.rm = TRUE), medianSugars = median(sugars, na.rm = TRUE)) %>%
  pivot_longer(cols=c("meanSugars", "medianSugars"), names_to="summary", values_to="sugars") %>%
  ggplot(aes(x=mfr, y=sugars, colour=summary, group=summary)) + geom_point() + geom_line()
  • Q6: For marketing reasons, the kids cereals are generally located on the second shelf. You wouldn’t put Froot_Loops on the top shelf out of kids’ reach would you? Are there any key nutritional differences between childrens’ and adult cereals?

Note: When using ggplot it is important here to use factor(shelf) instead of shelf. Check out the data class for each of the varialble.

cereal %>%
  filter(shelf == 2) %>%
  select(name)

## Or

cereal[cereal[,"shelf"]==2, "name"]

class(cereal[,"shelf"])
class(factor(cereal[,"shelf"]))
table(factor(cereal[,"shelf"]))


## Make a boxplot with sugars on the y-axis and shelf on the x-axis by yourself. Don't forget to use factor()

3.2 Cereal (extension)

  • E1: A variable named ‘rating` was calculated by Consumer Reports. Which nutrients are most correlated with the variable ’rating’? Cereals on the middle shelf in supermarkets tended to have the lowest ratings. Why?

  • E2: Sugar is a major ingredient in many breakfast cereals. Is there a difference between children’s and adult cereals?

  • E3: To interactively examine your data in R. We show a scatter plot between rating and fiber. You can hover over the dots to show the name of the cereal.

library(plotly)
p <- cereal %>%
  ggplot(aes(x=fiber, y=rating, name=name)) +geom_point() 
ggplotly(p)

4 Lab report

Only this section needs to be included in your Module 1 Lab Report to be handed in at the end of Week 3.

While I expect that you should explore the Framingham data, I am not opposed to you submiting a report on a dataset that you are incredibly engaged with.

Report guidelines

There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:

  • Think about how your report might be structured (eg. Summary, Introduction, Results, Conclusion).
  • Do your figures have captions?
  • Are your figures legible?
  • Are you using sub-headings effectively?

It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”

Marking criteria

  • 0.5 mark - Visualisation – At least 2 plots.
  • 0.5 mark - Communication – Context around results.
  • 0.5 mark - Communication – Insightful context around results.
  • 0.5 mark - Presentation – No extra R output, use of headings.
  • 0.5 mark - Innovation – Try at least one of captions, table of contents, embedded numbers in text, a plot that isn’t a bar-plot, boxplot or histogram etc.

Lab instructions

4.1 Week 2

Continue with the framingham data from week 1. If you are comfortable, feel free to:

  • Formally test your hypotheses with t-tests or chi-square tests.
  • Write a report on a different dataset.
  • Condense or streamline your report to communicate your findings to a journal, the government or the general public.

4.2 Domain knowledge

The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease. We will be investigating is a subset of the data collected.

  • Who collects this data and how is is reported?

Resource

  • Is there anything unusual about the data I have given you?
  • How are missing values recorded, and why might they occur?

Resource

4.3 Example questions?

  • Are people that had heart attacks older than those that didn’t?
  • Do men have higher blood pressure than women ?
  • Do men have more heart attacks than women ?

4.4 Load data

## Loading data directly from the web
heartData = read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/frmgham.csv")

## If your data file is in the same folder, you can also use code
heartData <- read.csv("frmgham.csv")

## size of the data set
dim(heartData)
  • How does R classify your R object heartData?
class(heartData)
  • How does R classify each of the variables in your R object heartData?
str(heartData)
  • You may like to redefine the classification of a variable differently. This component is more critical later in the course.
sex <- heartData$SEX
sexC <- as.character(sex)
sexF <- factor(sexC, levels=c(1,2), labels=c("Men","Women"))
class(sex)
class(sexC)
class(sexF)

4.5 Explore variables

Select a univariate variable and explore using what you have learn from MATH1X05 or DATA1X01.

  • Produce numerical summaries
head(heartData$SYSBP)
summary(heartData$SYSBP)
table(heartData$SEX)
  • Try using some ‘base R’ graphics.
hist(heartData$SYSBP, prob=T)
boxplot(heartData$SYSBP)
  • Customise the plots.
hist(heartData$SYSBP, freq=FALSE, main="Histogram", ylab="Probabilities", col="green")
boxplot(heartData$SYSBP, horizontal=TRUE, col="red")
  • Looking at relationship between Age and Sex.
boxplot(SYSBP ~ SEX, data=heartData)
  • Try this using ggplot.
library(tidyverse)
## Graphical summary of the variable "Age"
ggplot(heartData, aes(x=1, y=SYSBP)) + geom_boxplot()

## Relatoinship between Age and Sex.
ggplot(heartData, aes(x=factor(SEX), y=SYSBP)) + geom_boxplot()

## Adding some colors
ggplot(heartData, aes(x=factor(SEX), y=SYSBP, fill=SEX)) + geom_boxplot()

ggplot(heartData, aes(x=factor(SEX), y=SYSBP, col=SEX)) + geom_boxplot()
  • (Extension) Generate a violin plot, does it take longer to plot ?
ggplot(heartData, aes(x=factor(SEX), y=SYSBP)) + geom_violin()

4.6 Rmarkdown report

Provide your data analytics code as well as summarising your findings in reproducible report (e.g. Rmarkdown report).