Aims:
to practice statistical thinking on graphical summaries
to produce a data analytic report in Rmarkdown
to become familiar with ggplots
Complete Sections 2 and 3 of the Guide to R at home. Review materials from your introductory statistics course.
For the exercises in this lab it is expected that you will work in groups as it will expose you to different views, ideas and opinions. However, please make an effort to complete your own report by typing your own code and (most importantly) expressing your own conclusions and interpretations.
“The greatest value of a picture is when it forces us to notice what we never expected to see.”-John W. Tukey
The link for this weeks google docs and Jamboard (digital whiteboard) is here.
At the beginning of this lab break up into groups of four or five. We will allocate zoom breakout rooms for those attending remotely. This group work is designed to generate discussion and provide an opportunity to practise statistical thinking and communicating statistical concepts. There are no “correct” answers. As a group discuss and brainstorm the following:
This data is taken from https://dasl.datadescription.com/datafile/cereals/ and
can be downloaded from https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/Cereal.csv.
One of the variable mfr
represents the manufacturer of
cereal where A
= American Home Food Products,
G
= General Mills, K
= Kelloggs,
N
= Nabisco, P
= Post, Q
= Quaker
Oats, R
= Ralston Purina.
Please note that the code provided should be used as a guide only and not complete solutions.
library(ggplot2)
library(tidyr)
## Reading in the data
cereal = read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/Cereal.csv")
# cereal = read.csv('Cereals.csv')
## Setup rownames
rownames(cereal) = cereal[, 1]
## Looking at the start of the data
head(cereal)
## `colnames(cereal)` and `rownames(cereal)` provides information about the
## names of the columns and rows in the dataset.
colnames(cereal)
summary(cereal[, "sugars"])
## redefine '-1' as NA
cereal[cereal == -1] = NA
summary(cereal[, "sugars"])
filter()
function can be used to
isolate certain rows and the select()
function certain
columns.## We can use the tidyverse with pipes.
library(tidyverse)
cereal %>%
filter(carbo==min(carbo, na.rm = TRUE)) %>%
select(name)
cereal %>%
filter(carbo==max(carbo, na.rm = TRUE)) %>%
select(name)
## You could also do this in base R
cereal[cereal[,"carbo"] == min(cereal[,"carbo"], na.rm = TRUE), "name"]
## Or
which.max(cereal[,"carbo"])
cereal[which.max(cereal[,"carbo"]),1]
cereal[which.min(cereal[,"carbo"]), "name"]
Q3: Which cereal has the highest fiber content?
Q4: Examine the distribution of carbohydrates across the different cereal brands. What does this tell us? What are some striking features?
ggplot(cereal, aes(x = mfr, y = carbo, fill = mfr)) + geom_boxplot()
## We can do this in by grouping by manufacturer and then summarising the sugar variable with mean.
cereal %>%
group_by(mfr) %>%
summarise(meanSugars = mean(sugars, na.rm = TRUE))
## We could also do this in base R using tapply
tapply(cereal[,"sugars"], cereal[,"mfr"], mean, na.rm = TRUE)
## Let's make a crazy plot. What does each line do?
cereal %>%
group_by(mfr) %>%
summarise(meanSugars = mean(sugars, na.rm = TRUE), medianSugars = median(sugars, na.rm = TRUE)) %>%
pivot_longer(cols=c("meanSugars", "medianSugars"), names_to="summary", values_to="sugars") %>%
ggplot(aes(x=mfr, y=sugars, colour=summary, group=summary)) + geom_point() + geom_line()
Note: When using ggplot it is important here to use
factor(shelf)
instead of shelf
. Check out the
data class for each of the varialble.
cereal %>%
filter(shelf == 2) %>%
select(name)
## Or
cereal[cereal[,"shelf"]==2, "name"]
class(cereal[,"shelf"])
class(factor(cereal[,"shelf"]))
table(factor(cereal[,"shelf"]))
## Make a boxplot with sugars on the y-axis and shelf on the x-axis by yourself. Don't forget to use factor()
E1: A variable named ‘rating` was calculated by Consumer Reports. Which nutrients are most correlated with the variable ’rating’? Cereals on the middle shelf in supermarkets tended to have the lowest ratings. Why?
E2: Sugar is a major ingredient in many breakfast cereals. Is there a difference between children’s and adult cereals?
E3: To interactively examine your data in R. We show a scatter
plot between rating
and fiber
. You can hover
over the dots to show the name of the cereal.
library(plotly)
p <- cereal %>%
ggplot(aes(x = fiber, y = rating, name = name)) + geom_point()
ggplotly(p)
Only this section needs to be included in your Module 1 Lab Report to be handed in at the end of Week 3.
While I expect that you should explore the Framingham data, I am not opposed to you submiting a report on a dataset that you are incredibly engaged with.
Report guidelines
There are no hard and fast guidelines to the final content of your submitted lab reports. For this lab you will be assessed on your ability to generate statistical questions, explore these with graphical summaries and interpret your findings. Your report will also need to be well-presented:
It is expected that your report will construct and communicate an interesting story in 4 - 6 paragraphs (ish). To do this, you should be a ‘bad’ scientist and explore the data until you find something that you think is interesting, or, can use to address the marking criteria. When preparing your report always think “is your report something that you would be proud to show your friends?”, “would your family be interested in the conclusions you made?” and “would they find it easy to read?”
Marking criteria
Lab instructions
Continue with the framingham data from week 1. If you are comfortable, feel free to:
The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease. We will be investigating is a subset of the data collected.
setwd()
to where your data is stored and download the data directly from the
website. (https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/frmgham.csv)## Loading data directly from the web
heartData = read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/frmgham.csv")
## If your data file is in the same folder, you can also use code
heartData <- read.csv("frmgham.csv")
## size of the data set
dim(heartData)
heartData
?class(heartData)
heartData
?str(heartData)
sex <- heartData$SEX
sexC <- as.character(sex)
sexF <- factor(sexC, levels = c(1, 2), labels = c("Men", "Women"))
class(sex)
class(sexC)
class(sexF)
Select a univariate variable and explore using what you have learn from MATH1X05 or DATA1X01.
head(heartData$SYSBP)
summary(heartData$SYSBP)
table(heartData$SEX)
hist(heartData$SYSBP, prob = T)
boxplot(heartData$SYSBP)
hist(heartData$SYSBP, freq = FALSE, main = "Histogram", ylab = "Probabilities", col = "green")
boxplot(heartData$SYSBP, horizontal = TRUE, col = "red")
boxplot(SYSBP ~ SEX, data = heartData)
library(tidyverse)
## Graphical summary of the variable 'Age'
ggplot(heartData, aes(x = 1, y = SYSBP)) + geom_boxplot()
## Relatoinship between Age and Sex.
ggplot(heartData, aes(x = factor(SEX), y = SYSBP)) + geom_boxplot()
## Adding some colors
ggplot(heartData, aes(x = factor(SEX), y = SYSBP, fill = SEX)) + geom_boxplot()
ggplot(heartData, aes(x = factor(SEX), y = SYSBP, col = SEX)) + geom_boxplot()
ggplot(heartData, aes(x = factor(SEX), y = SYSBP)) + geom_violin()
Provide your data analytics code as well as summarising your findings in reproducible report (e.g. Rmarkdown report).