Aim

  • This is a self-study guide to R.
  • It allows you to consolidate your learning from labs, by learning new R commands through the one simple data set mtcars.
  • Sections marked * are not examinable in MATH1005.
  • Sections marked ** are for DATA1901, and for students who want to extend themselves.


Introduction to R/RStudio

R is an incredibly powerful open source language for statistical analysis which we will run through RStudio. If you have never coded before, it may seem hard at first, but it will soon become easier!


0.1 Find RStudio (on campus)

  • RStudio is available in all university labs, as a workspace on Ed Discussion Board, and on BYOdevice.

  • However, we also recommend that you install both R and RStudio onto your own computer.


0.2 Install R and RStudio (at home)

R and RStudio are separate packages. First install R, and then install RStudio as the user interface.

0.2.1 Install R

  • Go to R Project.

  • Download the relevant version: Linux, Mac or Windows.

  • When you are finished, you should see an icon on your desktop with a large capital `R’.


0.2.2 Install RStudio

  • Go to RStudio.

  • Download the installer for your platform.

  • When you are finished, you should see an icon on your desktop with a large RStudio.


0.3 Open RStudio & use it as a calculator

  • Click on the large RStudio icon.

  • Go the console (LHS at the bottom, where you see the cursor >).

  • Copy the following commands into the bottom Console. After each command, press the Enter/Return key on your keyboard.
1+ exp(3) + sin(0.5)
x=c(1,2,3)
x^2
sum(x)
sum(x^3)
  • Alternatively, you can copy the commands into the top Script Window. Highlight all the commands, and pressRunfunction. Note that the output is in the bottom Console.


0.4 Experiment with data in R

Many data sets are already loaded into R. They are good to experiment with, and are used in the R/LQuizzes.

  • For example, mtcars. To view the data, simple type its name.
mtcars

  • Clearly this is not recommended for large data sets! Instead, look at the first or last rows using the head or tail functions.
head(mtcars)
tail(mtcars, n=3)
  • You can find out about the data by using help().
help(mtcars)


0.5 Get organised: course folder

It’s very important to set up a neat file management system for the semester. This is best practise, and will make your life easier!

  • Create a course folder on your Desktop. eg DATA1001files.


0.6 Knit a given RMarkdown file

RMarkdown is a clever document/file which can save all your Rcode and comments in one place for easy editing.

  • You have been given the file LabDemo.Rmd on your Labs page on Canvas.

  • Download the file LabDemo.Rmd, and store it in your new course folder DATA1001files.

  • Double-click on the file and notice how it automatically opens in the top LHS window of RStudio.

  • The information at the top is called the YAML, which you can edit it: eg in author: "xxx", replace your name for xxx.)

  • Render the file using Knit.

  • This will create LabDemo.html, which you can open in a browser. This becomes your final report.

  • You can create new RMarkdown files by duplicating LabDemo.Rmd and renaming it Project1.Rmd etc.

  • To customise your RMarkdown file, see RMarkdown Cheat Sheet


0.7 Neat workflow

  • There is a handy way to streamline your workflow, so that the output of the .Rmd opens next to your input in the RStudio console. This makes it very easy to edit and see your results.
  1. Open preferences

  1. Select R Markdown

  2. Select ‘Viewer Pane’

  3. Click ‘Apply’

  • Now knit your LabDemo.Rmd'file and see what happens - neat!

  • This will remain the setup for next time you open RStudio.


0.8 Import data into R

There are many ways to import data into R.


0.8.1 From the internet

You can import data directly from the web.

  • If you right click on the datasets in Canvas, you’ll get the url. For exmaple, see Lab3 Road+.
road = read.csv("http://www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/2018S1/data/AllFatalities.csv")
  • This is an easy method, which we will use in the labs. However, it doesn’t work if your data doesn’t have an url (eg finding your own data for a Project).


0.8.2 From your folder

  • Download a data set from the Canvas Lab page, for example AllFatalities.csv from Lab3.

  • Put the datafile into theDATA1001files folder where your .Rmd. file is located.

  • Read the data into RStudio.
# Given file.csv
road = read.csv("AllFatalities.csv")


0.8.3 From a data subfolder

  • Longer-term when you have more data files, it can be useful to create a sub folder called data within your DATA1001file folder.

  • Now you can store all your data inside data and then read your data straight into R.
# Given file.csv
road = read.csv("data/AllFatalities.csv")


0.8.4 Using file.choose() and working directories

  • It is generally best practise to store your data near your .Rmd file.
  • However, you can also find a data file, by asking RStudio to browse the files on your computer. This will give you the file pathway (ie where the file is stored).
  • The working directory is where R is pointing, ie where it is drawing files from. You can change it, to point to that particular folder.
file.choose()  # choose your file, and then R gives you the pathway to that file
getwd()  # this gives the current working directory
setwd("...") # set the working directory to the folder where your file is saved.
data = read.csv("2016Fatalities.csv")  # read the data straight into R from that folder


0.8.5 Note about Excel files

  • Note: If your file is in Excel format PBS2015.xlsx, then you need the package readxl to first be installed. This data is found here.
# Given file.xlsx
install.packages("readxl")  # You only need to do this once in RStudio
library(readxl)  # You need to do this each time
data = read_excel("data/PBS2015.xlsx")


0.9 Install packages

What to do if a package is missing?

  • Suppose you try and knit your .Rmd, only to face the error below:

Error in library(somepackage) : there is no package called ‘somepackage’

  • No need to fret! All we need to do is install the package somepackage from CRAN, which is a repository for R packages.

  • What is an R package? Good question. You can think of an R package like an app on the app store : it contains sets of functions that allow R to have extra functionality. For example, R was at the beginning, designed to be used very much like a calculator (very basic!). To be able to knit reports, we need new packages to enable R to extend its capabilities beyond the most basic packages downloaded with R.

  • CRAN is very much like the App store - but for R!

  • So how do we access CRAN? Right through the command line!

  • In the console in R, you simply need to type

install.packages("somepackage")

  • And R automatically searches and downloads for that package on CRAN. You can then try knitting your report again! Sometimes, more than one package will be missing - if this is the case - just repeat the steps above until they are all installed.

(Note, of course, somepackage is being used as a place holder here! Replace with the name of the actual package that is missing).

In summary:

  • If want to use a certain package, use the following code. This is a 1 off step. It will now appear in the list of packages in the bottom LHS window.
install.packages("somepackage")
  • Then every time, you start a new .Rmd document, or session in R, you need to call up the package by using library.
library("somepackage")


0.10 Data Wrangling *

Data preparation is an essential part of statistical analysis and can be very time-consuming. It can involve cleaning or tidying, cleansing scrubbing, reshaping, reforming, splitting and combining. It must be performed carefully and transparently.

  • The aim is to change Messy (or Raw or Dirty) Data into Clean Data, which is correctly and consistently formatted and ready for analysis.This can involve removing redundant or useless information, standardising information (like calendar dates), separating or combining columns, and dealing with warnings.

  • Simple data can be cleaned in Excel. Get rid of any extra formatting, so that the data looks like:

ID (if applicable) Variable1 Variable2 Variable3
1 14 25 34.4
2 15 23 19.7
  • More complex cleaning can be done through a package like tidyverse. Install the package, as a one off command.
install.packages("tidyverse")
  • Each new session of RStudio, you will need to load the package.
library("tidyverse")



1 Structure of data

We will use mtcars to illustrate the structure of data.


1.1 Classify variables

  • Recall there are 2 main types of variables: qualitative and quantitative, which R calls Factor and num.

  • View the data.

mtcars
head(mtcars)
tail(mtcars, n=3)
help(mtcars)
  • Calculate the dimensions of the data set.
dim(mtcars)

This means that there are 32 rows (the types of cars) and 11 variables (properties of the cars).


  • List the names of the variables.
names(mtcars)


  • See how R has classified the variables by viewing the structure of the data.
str(mtcars)

where ‘num’ is a quantitative (numerical) variable and ‘Factor’ is a qualitative (categorical) variable.


1.2 Isolate a variable

  • Choose one variable from the data frame by using DataName$VariableName and store the result in a vector.
mpg= mtcars$mpg

Note that RStudio has code completion, so will auto-predict your commands. When you type mtcars$, the names of the all the variables will come up.


  • See the classification of 1 variable.
class(mpg)
str(mpg)


  • See the length of 1 variable.
length(mpg)


  • Calculate the sum of a (quantitative) variable.
sum(mpg)
  • If at any command you get the answer NA, it means that you need to specify what to do with missing values. See Resource on how to solve this.


  • Sort the data in increasing order.
sort(mpg)


  • Work out how to sort the data in decreasing order.
sort(mpg, decreasing = T)


  • Sum the 5 lowest values of the variable.
sum(sort(mpg)[1:5])


1.3 Select subset

  • Pick the 1st and 5th elements of the vector mpg
mpg[1]
mpg[5]
mpg[c(1,5)]
mtcars$mpg[c(1,5)]
mtcars[1,1]  
mtcars[5,1]   #mpg is 1st column.


1.4 Change classification

  • You may not agree with R’s initial classification, and want to change it.
str(mtcars)
  • For example, note that the number of carburetors carb is classified as num. Reclassify carb as a factor.
class(mtcars$carb)
carbF = factor(mtcars$carb)
class(carbF)
  • To change from a factor to a num:
ageCanVote = factor(setNames(c(16, 18, 18, "Unknown"), c("Austria", "Australia", "Afghanistam", "Zambia")))
as.numeric(ageCanVote)  # This is a mistake, as it converts to the rank of the factor level
as.numeric(as.character(ageCanVote))  # This converts properly
## Warning: NAs introduced by coercion



2 Graphical Summaries

The graphical summary must match up with the type of variable(s).

Variable Type of summary
1 Qualitative (Single) Barplot
1 Quantitative Histogram or (Single) Boxplot
2 Qualitative Double Barplot, Mosaicplot
2 Quantitative Scatterplot
1 Quantitative, 1 Qualitative Double Boxplot


2.1 Barplot

A barplot is used for qualitative variables.

  • Which variables are qualitative in mtcars?

  • Produce a single barplot of the gears.

barplot(mtcars$gear)

Notice this is not useful! This is because R has classified gear as a quantitative variable.

  • Instead, first summarise gears into a table as follows.
# Produce frequency table
table(mtcars$gear)
## 
##  3  4  5 
## 15 12  5
# produce barplot (from frequency table)
counts = table(mtcars$gear)
barplot(counts)
  • Now customise the barplot.
help(barplot)
barplot(counts, names.arg=c("3 Gears","4 Gears","5 Gears"),col="lightblue")
  • Make the names of bars perpendicular to axis.
par(las=2) 
barplot(counts, names.arg=c("3 Gears","4 Gears","5 Gears"),col="lightblue")
  • Now consider 2 qualitative variables: gear and cyl. Produce a double barplot by faceting or filtering the barplot of gear by cyl.
counts1 = table(mtcars$cyl, mtcars$gear)
barplot(counts1,names.arg=c("3 Gears","4 Gears","5 Gears"),col=c("lightblue","lightgreen","lightyellow"),legend = rownames(counts1))
barplot(counts1,names.arg=c("3 Gears","4 Gears","5 Gears"),col=c("lightblue","lightgreen","lightyellow"),legend = c("4 cyl","6 cyl","8 cyl"))
barplot(counts1, names.arg = c("3 Gears", "4 Gears", "5 Gears"), col = c("lightblue", 
    "lightgreen", "lightyellow"), legend = c("4 cyl", "6 cyl", "8 cyl"), beside = TRUE)

What do you learn?


2.2 Histogram

A histogram is used for quantative variables.

  • Which variables are quantitative in mtcars?

  • Produce a histogram of the weights.

hist(mtcars$wt)
  • Produce a probability histogram of the weights. What is the difference? Why do the 2 histograms have an identical shape here?
hist(mtcars$wt,freq=F)
  • In this course, we will consider the probability histogram (2nd one) which means that the total area of the histogram is 1.

  • What does the histogram tell us about weights of the cars?

  • To see what customisations for hist are available, use help.

help(hist)
  • Try this histogram of the weights.
hist(mtcars$wt, br=seq(0,6,by=0.5), freq=F, col="lightgreen",xlab="weight of cars (1000 lbs)",main="Histogram of Weights of Cars US 1973-74")
  • Experiment with the customisations to see how they work.

  • Try a histogram of the gross horsepower. What do you learn?

  • Produce a histogram of mpg. What do you learn?


2.3 Boxplot

A boxplot is another summary for quantitative variables.

  • Produce a single boxplot for the weights of cars.
boxplot(mtcars$wt)
  • Produce a horizontal boxplot.
boxplot(mtcars$wt, horizontal = T)

Which orientation do you prefer?

Compare to the histogram of weights above: what different features are highlighted by a boxplot?

  • Customise your boxplot.
help(boxplot)
  • Now consider dividing the weights (qualitative) by cylinders (qualitative). Produce a double boxplot, by filtering or faceting wt by cyl.
boxplot(mtcars$wt~mtcars$cyl)
boxplot(mtcars$wt~mtcars$cyl, names=c("4 cyl", "6 cyl","8 cyl"),ylab="Weight of cars (1000 lbs)")

What do you learn about the weights of cars? Car Weight - see page 6

  • Try facetting the weights by another qualitative variable.


2.4 Mosaicplot

A mosaic plot visualises 2 qualitative variables.

counts2 = table(mtcars$gear, mtcars$am)  # Produces contingency table
plot(counts2) # Produces mosaic plot from contingency table


2.5 Scatterplot

A scatter plot considers the relationship between 2 quantitative variables.

  • What does this plot tell us?
plot(mtcars$wt,mtcars$mpg)
plot(mtcars$wt,mtcars$mpg, xlab="Car Weight", ylab="Miles per Gallon",col="darkred",pch=19)
  • Add a line of best fit.
abline(lm(mtcars$mpg~mtcars$wt))

We’ll expore this further in Section 6.

  • We can compare pairs of multiple quantitative variables using plot or pairs.
plot(mtcars)
pairs(~mpg+disp+drat+wt,data=mtcars)
pairs(mtcars)

Which variables seem to be related?




2.6 ggplot **

All the previous (base R) plots can be done in ggplot, allowing much greater customisation.

  • First, download the packageggplot into RStudio. This is a one off command.
install.packages("ggplot2")
  • Each time you open RStudio, load the ggplot2 package
library(ggplot2)


2.6.1 Barplot

  • Produce a single barplot.
# mtcars data
p = ggplot(mtcars, aes(x=factor(cyl)))  # Select the mtcars data, and focus on cyl as factor (qualitative) on x axis
p + geom_bar()  # Produce a barplot
# mpg data
p1 = ggplot(mpg, aes(class)) # Select the mpg data, and focus on class as x axis
p1 + geom_bar() # (1) Produce a barplot
p1 + geom_bar(aes(weight = displ)) # (2) Produce a barplot with counts from displacent variable
  • Produce a double barplot.
p1 + geom_bar(aes(fill = drv)) # (3) Produce a (double) barplot divided by the drive variable
p1 +
 geom_bar(aes(fill = drv), position = position_stack(reverse = TRUE)) +
 coord_flip() +
 theme(legend.position = "top")  # (4) Customising (3)



2.6.2 Histogram

Produce a histogram of the weights.

p = ggplot(data=mtcars, aes(x=wt))  # Select the mtcars data, and focus on wt (quantitative) on x axis
p + geom_histogram(aes(y=..density..),binwidth=0.5)  
+ xlab('Weight')+ylab('Density') # Produce a histogram with x and y axis labels

Using aes(y=..density..) turns a raw histogram into a probability histogram.


2.6.3 Boxplot

  • Produce a single boxplot.
p = ggplot(data=mtcars, aes(x="", y=wt))  # Select the mtcars data, and focus on wt (quantitative) on y axis (with no filtering on x axis)
p + geom_boxplot() # Produce a boxplot
  • Produce a double boxplot.
p = ggplot(data=mtcars, aes(x=factor(cyl),y=wt))  # Select the mtcars data, and focus on wt (quantitative) on y axis and cyl (qualitative) on x axis
p + geom_boxplot() # Produce a boxplot, of wt filtered by cyl


  • geom_jitter plots the points with a small amount of random noise. We use it to investigate over-plotting in small data sets.
p = ggplot(data=mtcars, aes(x=factor(cyl),y=wt))
p + geom_boxplot() + geom_jitter()


What do the following customisations do?

p = ggplot(data=mtcars, aes(x=factor(cyl),y=wt))
p + geom_boxplot() + coord_flip()
p + geom_boxplot(notch = TRUE)
p + geom_boxplot(outlier.colour = "green", outlier.size = 3)


Now consider dividing the weights by another qualitative variable.

p + geom_boxplot(aes(fill = factor(cyl)))
p + geom_boxplot(aes(fill = factor(am)))


2.6.4 Scatterplot

  • Produce a scatterplot.
p = ggplot(mtcars, aes(wt, mpg)) # Select the mtcars data, and focus on wt (quantitative) on x axis and mpg (quantiative) on y axis
p + geom_point()  # Produce a scatterplot of mpg vs wt
p + geom_point(aes(colour = factor(cyl)))  # Colour the points by cyl (qualitative)


  • Notice what these customisations do.
p + geom_point(aes(shape = factor(cyl)))  # Shape the points by cyl (qualitative)
p + geom_point(aes(shape = factor(cyl))) + scale_shape(solid = FALSE) 
p + geom_point(aes(size = qsec))  # Size the points by qsec (qualitative)
p + geom_point(aes(colour = cyl)) + scale_colour_gradient(low = "blue")  # Colour the points by cyl (quantitative)


2.7 plotly **

plotly or plot.ly is included in the `ggplot2 package and allows you to have offline interactive tools.

  • Produce this scatter plot. Notice what happens when you hover over the data points.
library("plotly")
p1 = plot_ly(mtcars, x = ~mpg, y = ~wt, type="scatter") 
p1
  • Now add another variable.
p2 = plot_ly(mtcars, x = ~mpg, y = ~wt, symbol = ~cyl, type="scatter") 
p2
p3 = plot_ly(mtcars, x = ~mpg, y = ~wt, color = ~cyl, type="scatter") 
p3
p4 = plot_ly(mtcars, x = ~mpg, y = ~wt, color = ~factor(cyl), type="scatter") 
p4
  • Try a boxplot.
library("plotly")
p5 = plot_ly(mtcars, x = ~mpg, y = ~factor(cyl), type='box') 


3 Numerical Summaries

The numerical summary must match up with the type of variable(s).

Variable Type of summary
1 Qualitative frequency table, most common category
1 Quantitative mean, median, SD, IQR etc
2 Qualitative contingency table
2 Quantitative correlation, linear model
1 Quantitative, 1 Qualitative mean, median, SD, IQR etc across categories


We’ll keep working with the mtcars dataset. So again remind yourself what it is like.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
help(mtcars)


3.1 Frequency and contingency tables

  • A frequency table summarises 1 qualitative variable.
table(mtcars$gear)
  • A contingency table summarises 2 qualitative variable.
table(mtcars$gear, mtcars$am)


3.2 Mean and median

  • The mean and median measure centre for quantitative variables.
mean(mtcars$gear)
median(mtcars$gear)


3.3 Standard deviation (SD)

  • The standard deviation measures spread for quantitative variables.

  • The sd command calculates the sample standard deviation. The squared SD is the variance.

sd(mtcars$gear)
sd(mtcars$gear)^2
var(mtcars$gear)
  • The popsd command calculates the population standard deviation, but requires the multicon package.
#install.packages(multicon)   # a package only needs to be installed once.
library(multicon)
popsd(mtcars$gear)

# Longer way
N = length(mtcars$gear)
sd(mtcars$gear)*sqrt((N-1)/N)
  • Note: When we model a population by the box model [Section 8 and following], we will require the population SD.


3.4 Interquartile range (IQR)

  • The quickest method is to use IQR.
IQR(mtcars$gear)
  • There are lots of different methods of working out the quartiles. We can use the quantile command, and then work out the IQR.
quantile(mtcars$gear)
quantile(mtcars$gear)[4]-quantile(mtcars$gear)[2]

What is the 50% quantile equivalent to?


3.5 Summary

  • The numerical summaries for quantative variables can all be produced with summary, which is an expanded version of the 5 number summary. Sometimes these values will vary from using quantile as there are different conventions for calculating quartiles.
summary(mtcars$mpg)  # 1 variable
summary(mtcars)  # All variables
  • We can consider a subset of the data. Here, we choose the mpg of cars which have a weight greater or equal to 3.
summary(mtcars$mpg[mtcars$wt>=3])

Here we take all the data from mtcars dataset for a specific cylinder e.g. 6.

mtcars[ which(mtcars$cyl==6), ]


3.6 Linear Correlation

  • We can consider the linear correlation between pairs of quantitative variables.
cor(mtcars)



4 Normal model


4.1 Plot curve

  • To plot a normal curve, we use `dnorm’ for a range of \(x\) values.
par(mfrow=c(1,2)) # plots 2 plots on same row
curve(dnorm(x),-1,1)
curve(dnorm(x),-5,5)

When does the Normal curve start to disappear?


4.2 Plot area under curve (Ext)

  • To plot an area under the curve, we use polygon.
curve(dnorm(x), xlim = c(-2, 2),ylab="",axes=FALSE)
abline(h = 0)
sequence = seq(-2, 0.8, 0.1)
polygon(x = c(sequence,0.8,-2),
        y = c(dnorm(c(sequence)),0,0),
        col = "purple")
axis(1, at=c(-2,-1,0,0.8,1,2), pos=0)


4.3 Add normal to a histogram

hist(mtcars$wt,prob=T)
m=mean(mtcars$wt)
s=sd(mtcars$wt)
curve(dnorm(x, mean=m, sd=s), 
col="darkgreen", lwd=2, add=TRUE)


4.4 Probabilities: Standard normal

4.4.1 Lower tail

To calculate P(Z < 0.8) use pnorm.

pnorm(0.8)


4.4.2 Upper tail

To calculate P(Z > 0.8):

pnorm(0.8,lower.tail=F)
1-pnorm(0.8)


4.4.3 Interval

To calculate P(0.3 < Z < 0.7):

pnorm(0.7)-pnorm(0.3)


4.5 Probabilities: General normal

If \(X \sim N(5, 4^2)\), then to calculate P(X < 8), we use pnorm(8,mean,sd).

pnorm(8,mean=5,sd=4)
pnorm(8,5,4)


5 Linear model

5.1 Scatter plot

  • To produce a scatter plot, use plot(<x variable>, <y variable>).
plot(mtcars$mpg,mtcars$wt)
plot(mtcars$mpg,mtcars$wt, xlab="mpg", ylab="weight", main="Scatter plot of weight vs mpg")


5.2 Correlation coefficient

  • To calculate the correlation coefficient, use cor(<x variable>, <y variable>).
cor(mtcars$mpg,mtcars$wt)


5.3 Regression line

  • To produce the regression line, use lm(<y variable>~<x variable>).
lm(mtcars$wt~mtcars$mpg)
#Or
L=lm(mtcars$wt~mtcars$mpg)
L$coeff
#Or
lm(wt~mpg,data=mtcars)
  • Now you can add this line to the scatter plot.
plot(mtcars$mpg,mtcars$wt)
abline(L)


5.4 Residual plot

  • The residual plot helps to detect any pattern not captured by the linear model.

    • If it is a random scatter, then the linear model seems appropriate.
    • If it shows a pattern (eg quadratic), then another model should be considered.
  • To produce a residual plot:

plot(mtcars$mpg,L$residuals, xlab = "mpg", ylab = "Residuals")
abline(h = 0, col = "blue")

6 Non-Linear Model *

If a linear model is not appropriate, as evidenced by either the scatter plot (non-linear) or residual plot (non-random), we can either try a transformation or directly fit a more complex model.

6.1 Transformation

x = mtcars$wt
y = mtcars$mpg
plot(x,y)
cor(x,y)
plot(log(x),log(y))  # scatterplot of log of data
cor(log(x),log(y))

6.2 Fit non-linear model (Ext)

plot(x,y)
# Set seed for fixing 1 simulation
set.seed(1)
# Simulate the x values
simx=seq(min(x),max(x),length.out=length(x))
# Model the y values y=12+ke^(-x), with a starting value of k 
model = nls(y~12+k*exp(-simx), start=list(k=500))
summary(model)
# Add model to scatter plot
lines(simx,predict(model),lty=2,col="green",lwd=3)



7 Simulate chance

7.1 Sample (Coin)

  • To simulate 1 fair coin toss
set.seed(1)  # ensures the same simulation, "1" can be whatever you choose
sample(c("H", "T"), 1, replace = T)
  • To simulate 1 biased coin toss
set.seed(1)
sample(c("H", "T"), 1, replace = T,prob = c(0.9, 0.1))
  • To simulate 10 fair coin tosses, and summarise chances in a table:
set.seed(1)
sample1 = sample(c("H", "T"), 10,replace=T)
table(sample1)/10
  • Alternatively, you could use
set.seed(1)
replicate(10, sample(c("H", "T"),1,replace=T))
  • Alternatively, you could use the Binomial *,
set.seed(1)
rbinom(n = 10, 1, 0.5)

7.2 Sample (Dice)

  • To simulate 1 fair die toss
set.seed(1)
sample(c(1:6), 1, replace = T)
  • To simulate 1 biased die toss
set.seed(1)
sample(c(1:6), 1, replace = T,prob = c(0.1,0.2,0.3,0.2,0.1,0.1))
  • To simulate 10 fair dice tosses, and summarise chances in a table:
set.seed(1)
sample(c(1:6), 10, replace = T)
table(sample(c(1:6), 10, replace = T))/10
barplot(table(sample(c(1:6), 10, replace = T))/10, xlab="dice face", ylab="chance")
  • Alternatively, you could use
set.seed(1)
table(replicate(10, sample(c(1:6), 10, replace = T)))

7.3 Function **

To toss a die 10 times, and find the chance of getting a 6:

tossdie = function() {
    rolls = sample(1:6, size = 10, replace = TRUE)
    condition = sum(rolls == 6) > 0
    return(condition)
}
# Simulate 10000 times
sim = replicate(10000, tossdie())
sum(sim)/length(sim)


7.4 Binomial Model *

Suppose we have 10 trials with P(event) = 0.3.

The chance of getting 4 events:

dbinom(4,10,0.3)

The chance of getting less than or equal to 5 events:

pbinom(5,10,0.3)



8 Simulate chance variability (box model)

Suppose we have a “box” (or population) containing 0 and 1, from which we draw 10 times creating a sample. We have formal theory which gives the expected value (EV) and standard error (SE) of both the Sample Sum (sum of the sample values and Sample Mean (mean of the sample values).

Focus EV SE
Sample Sum number of draws * mean of box \(\sqrt{\mbox{number of draws}}\) * SD of box
Sample Mean mean of box \(\frac{SD}{\sqrt{\mbox{number of draws}}}\)


8.1 Draw the box model in R (Ext)

  • What matters most is being able to draw a box model by hand. But below is some code for those who are interested.

  • Install the package DiagrammeR. If you have trouble installing, you can just use it on the uni labs.

  • Draw a box model in R.

library("DiagrammeR")

grViz(" 
  digraph CFA {

  # All
  node [fontname = Helvetica, fontcolor = White]

    # Box
    node [shape = box, style=filled, color=SteelBlue4, width=2 label='0  1'][fillcolor = SteelBlue4]
    a ; 

    # Sample
    node [shape = circle, style=filled, color=IndianRed, width=0.5, label='-'][fillcolor = IndianRed]
    b ; 

    # Draws
    a -> b [fontname = Helvetica,label = '10 draws',fontsize=8]
    b -> a  [fontname = Helvetica,color=grey,arrowsize = 0.5]
  }

")
  • Define the box (mean, SD).
box=c(0,1) # Define the box
n = 10  # Define the sample size (number of draws)

# mean & SD of box
meanbox=mean(box)
library(multicon)
sdbox=popsd(box)

# Other calculations of the population SD.

# (1) RMS formula
sqrt(mean((box - mean(box))^2))

# (2) short cut for binary box
(max(box) - min(box)) * sqrt(length(box[box == max(box)]) * length(box[box == 
    min(box)])/length(box)^2)


8.2 Sample Sum (EV & SE)

# EV & SE of Sample Sum
n * meanbox  # EV
sqrt(n) * sdbox  # SE


8.3 Simulate the Sample Sum

set.seed(1)
totals = replicate(20, sum(sample(box, 10, rep = T)))
table(totals)
mean(totals)
sd(totals)
hist(totals)

How do the simulations compare to the exact results?


8.4 Use Normal model for Sample Sum

  • What is the chance that the sample sum is between 4 and 5?

  • Draw the curve (Ext).

m = n * meanbox
s = sqrt(n) * sdbox
threshold1 = 4
threshold2 = 5
xlimits1=seq(round(-4*s+m, digit=-1),round(4*s+m, digit=-1),1)
curve(dnorm(x, mean=m, sd=s), m-4*s,m+4*s, col="indianred", lwd=2, xlab="", ylab="", main ="P(sum is between 4 and 5)",axes=F)
sequence <- seq(threshold1, threshold2, 0.1)
polygon(x = c(sequence,threshold2,threshold1),
        y = c(dnorm(c(sequence),m,s),0,0),
        col = "indianred")
axis(1,xlimits1,line=1,col="black",col.ticks="black",col.axis="black")
mtext("sum of box",1,line=1,at=-2,col="black")
axis(1,xlimits1,labels=round((xlimits1-m)/s,2),line=3.3,col="indianred1",col.ticks="indianred1",col.axis="indianred1")
mtext("Z score",1,line=3.3,at=-2,col="indianred1")
  • Find the chance.
m = n * meanbox  # Use EV as the mean of the Normal
s = sqrt(n) * sdbox   # Use SE as the SD of the Normal
pnorm(5, m, s)-pnorm(4, m, s)


8.5 Sample Mean (EV & SE)

# EV & SE of Sample Mean
meanbox  # EV
sdbox/sqrt(n)  # SE


8.6 Use Normal model for Sample Mean

  • What is the chance that the sample mean is between 0.4 and 0.6?

  • Find the chance.

m = meanbox  # Use EV as the mean of the Normal
s = sdbox/sqrt(n)  # Use SE as the SD of the Normal
pnorm(0.6, m, s)-pnorm(0.4, m, s)



9 Sample surveys

  • Simulate a survey of 25 people taken from a population of size 200 with 2 types of people in the proportion 1:3.

  • This is equivalent to taking 25 draws from a box containing 50 1’s and 150 0’s.

  • Note: A survey is technically without replacement, but here we will approximate by the box model which assumes replacement.


9.1 Box model (assuming replacement)

  • Draw the model (Ext)
library("DiagrammeR")

grViz(" 
  digraph CFA {

  # All
  node [fontname = Helvetica, fontcolor = White]

    # Box
    node [shape = box, style=filled, color=SteelBlue4, width=2 label='50x1s  150x0s'][fillcolor = SteelBlue4]
    a ; 

    # Sample
    node [shape = circle, style=filled, color=IndianRed, width=0.5, label='-'][fillcolor = IndianRed]
    b ; 

    # Draws
    a -> b [fontname = Helvetica,label = '25 draws',fontsize=8]
    b -> a  [fontname = Helvetica,color=grey,arrowsize = 0.5]
  }

")
  • Define the box (mean, SD)
box=c(rep(1,50),rep(0,150))  # Define the box/population
n = 25 # Number of draws (sample size)

# mean & SD of box
meanbox = mean(box)
sdbox = popsd(box)
  • The Sample Sum = number of 1’s in the sample.
n*meanbox  # EV
sqrt(n)*sdbox  # SE
  • The Sample Mean = proportion of 1’s in the sample (EV & SE)
meanbox  # EV
sdbox/sqrt(n)  # SE


9.2 Bootstrapping a box model

  • A survey of 10 people is taken from a population of size 200 with 2 types of people, resulting in 4 0’s and 6 1’s.
  • In practise, we don’t normally know the proportions within the population, and so we can’t define the “box”.
  • So one approach is to “bootstrap”: we substitute the proportions of 0’s and 1’s in the sample for the 0’s and 1’s in the population (box).
  • This estimate is good when the sample is reasonably large.
# Bootstrap box model
boxest0 = 200*4/10
boxest1 = 200*6/10
box = c(rep(1, boxest1), rep(0, boxest0))
n = 10 # Size of sample (= number of draws)

# Mean & SD of box
meanbox = mean(box)  # population proportion of 1's
sdbox = popsd(box)

# EV & SE of Sample Mean (= sample proportion of 1's)
meanbox
sdbox/sqrt(n)
  • Note the expected value of sample proportion of 1’s (0.6), is equal to the population proportion of 1s (0.6), as it should be.

  • Hence, we expect 60% 1s in the sample with a SE of around 15 percentage points.


9.3 Confidence Intervals for population proportion

  • A survey of 30 people is taken from a population of size 600 with 2 types of people in the proportion 1:2.

  • This is equivalent to taking 30 draws from a box containing 200 1’s and 400 0’s (assuming replacement).

  • Define the box (mean, SD)

box=c(rep(1,200),rep(0,400))  # Define the box/population
n = 30 # Number of draws (sample size)

# mean & SD of box
meanbox = mean(box)
sdbox = popsd(box)
  • An approximate 68% confidence interval (CI) for the population proportion is \[\mbox{sample proportion} \pm 1 \times \mbox{SE for sample proportion} \]
# EV & SE of sample proportion of 1's
ev = meanbox
se = sdbox/sqrt(n)

# 68% CI
c(ev - 1*se,ev + 1*se)
  • An approximate 95% confidence interval (CI) for the population percentage is \[\mbox{sample proportion} \pm 2 \times \mbox{SE for sample proportion} \]
# 95% CI
c(ev - 2*se,ev + 2*se)
  • An approximate 99.7% confidence interval (CI) for the population percentage is \[\mbox{sample proportion} \pm 3 \times \mbox{SE for sample proportion} \]
# 99.7% CI
c(ev - 3*se,ev + 3*se)



10 Test for a proportion (using simulation)

  • Suppose we have a sample of size \(n\) from a population with proportion \(p\) of a certain trait.

  • We want to test a null hypothesis for the value of \(p\), by modelling \(H_{0}\) by a box model (with “1” representing the trait) with \(n\) draws.
  • We simulate samples from the box model, and then compare our actual sample to these results - ie how common is our sample?


10.1 Simple balanced box

  • We want to test \(H_{0}\): \(p = 0.5\).

  • We can produce a picture of the box model modelling \(H_{0}\) (Ext).

library("DiagrammeR")
  
DiagrammeR::grViz(" 
digraph rmarkdown {
  
graph [fontsize = 16, fontname = Arial, nodesep = .1, ranksep = .8]
node [fontsize = 16, fontname = Arial, fontcolor = White]
edge [fontsize = 12, fontname = Arial, width = 2]

Box [shape=oval,style=filled, color=SteelBlue3,width=5, label='1    0']

Sample [shape=oval, style=filled, color=SteelBlue2, label='']

Box -> Sample [label='   n draws']

}
")
detach(package:DiagrammeR)
  • Now simulate draws from the box, and compare to your sample. Here, suppose that \(n=20\), and choose a simulation size of 100.
set.seed(1)

# Define box (modelling Ho)
box=c(0,1)

# # Simulate 100 samples of size 20 from the box
totals = replicate(100, sum(sample(box, 20, rep = T)))
table(totals)
hist(totals)


10.2 Unbalanced box

  • We want to test \(H_{0}\): \(p = 0.2\).

  • We can produce a picture of the box model modelling \(H_{0}\) (Ext).

library("DiagrammeR")
  
DiagrammeR::grViz(" 
digraph rmarkdown {
  
graph [fontsize = 16, fontname = Arial, nodesep = .1, ranksep = .8]
node [fontsize = 16, fontname = Arial, fontcolor = White]
edge [fontsize = 12, fontname = Arial, width = 2]

Box [shape=oval,style=filled, color=SteelBlue3,width=5, label='100p x 1    100(1-p) x 0']

Sample [shape=oval, style=filled, color=SteelBlue2, label='']

Box -> Sample [label='   n draws']

}
")
detach(package:DiagrammeR)
  • Now simulate draws from the box, and compare to your sample. Here, suppose that \(n=30\), and choose a simulation size of 1000
set.seed(1)

# Define box (modelling Ho)
box=c(0,1)

# Simulate 1000 samples of size 30 from the box
totals = replicate(1000, sum(sample(box, 30, prob=c(0.8,0.2), rep = T)))
table(totals)
hist(totals)



11 Tests for a mean

  • Suppose we have a sample of size \(n\) from a population with unknown \(\mu\).

  • We want to test a null hypothesis for the value of \(\mu\), by modelling \(H_{0}\) by a box model with \(n\) draws.



11.1 1 sample t-test

In 2013, the average miles per gallon (mpg) for cars sold in the US was 23.6. We want to see if this is a significant difference from the mtcars data.


11.1.1 H: Hypotheses

  • Formally, we write \(H_0: μ=23.6\) vs \(H_1: μ \neq 23.6\).

  • Have a quick look to see if \(H_0: μ=23.6\) seems to fit with data (eg right units and size etc).

mean(mtcars$mpg)  # sample mean
length(mtcars$mpg) # size of sample 
  • Now we model \(H_0: μ=23.6\) by a box with population mean 23.6, giving the EV of the sample mean as 23.6. We compare this to the observed sample mean of 20.1.


11.1.2 A: Check the Assumptions

For a \(t\) test (or \(Z\) test), we need to check the assumption of normality.

  • First, look at the shape of the data.
hist(mtcars$mpg)
boxplot(mtcars$mpg)

Notice the boxplot looks symmetric, which is consistent with normality. Note the histogram shows some light right skewing.


  • Next, try some more formal diagnostics.
qqnorm(mtcars$mpg)
shapiro.test(mtcars$mpg)

Check to see if the Q-Q Plot looks linear, as this indicates normality. The Shapiro test tests the null hypothesis that the sample comes from a Normal distribution. Here the p-value is quite big (0.1229) hence we would retain \(H_0\) which suggests normality.


  • Note: you can combine all 4 graphics in 2x2 window, for easy comparison.
par(mfrow=c(2,2))
qqnorm(mtcars$mpg)
shapiro.test(mtcars$mpg)
hist(mtcars$mpg)
boxplot(mtcars$mpg)


11.1.3 T: Calculate the Test Statistic

  • The formula for the \(t\)-test statistic is \(t_{obs} = \frac{\mbox{observed value - hypothesised value}}{\mbox{standard error}}\).

  • We use the SD of the data as an approximation of the SD of the population, so we can calculate the SE of the Sample Mean.

tobs = (mean(mtcars$mpg)-23.6)/(sd(mtcars$mpg)/sqrt(32))


  • The degrees of freedom of the \(T\) test statistic is \(\mbox{sample size} - 1\).
length(mtcars$mpg)-1


11.1.4 P: Calculate the p-value

  • The p-value is the probability of getting a value of \(t_{obs}\) or more extreme in either tail of the \(T\) distribution.
2*(1-pt(abs(tobs),31))
  • Note the use of abs here, allows for test statistic to be in either tail.


11.1.5 C: Conclusion

  • We compare the p-value (0.002476) to the significance level (0.05), and so we reject the null hypothesis. This is the statistical conclusion.
  • We then write a context specific conclusion on cars: ie the mpg of current cars appears to have changed from the US older cars in mtcars.


11.2 The speedy way!

  • We can do all this quickly in R!
t.test(mtcars$mpg, mu = 23.6)
  • Match up this output with the calculations above. Note it also gives us the 95% CI and the mean of the sample.


11.3 1 sample Z-Test (Ext)

  • The \(Z\)-Test is not in base R. This is because the \(Z\)-Test requires us to know the population variance \(\sigma\). Hence the \(t\)-Test is much more common.
  • However, we could create a function called z.test ourselves.
  • Assume the population variance of mtcars$mpg is the same as the sample variance (this of course is only a guess).
v = var(mtcars$mpg)
# This creates a function called z.test, with data, mu and var as the inputs, and zobs as the output.
z.test = function(data, mu, var){
   zobs = (mean(data) - mu) / (sqrt(var / length(data)))
   return(zobs)
}
# Run the function with inputs
z.test(mtcars$mpg,23.6,v)
  • Hence, the observed value of \(Z\) Test is -3.293877, which will give a tiny p-value and it is more than 3 standard deviations away from the mean.
2*(1-pnorm(abs(-3.293877)))
  • Hence again we conclude that the mpg of current cars appears to have changed significantly from the US older cars in mtcars.



12 Tests for relationships


12.1 More t-tests

The choice of the t-test depends both on the design of the experiment and the nature of the data.

Type Context Data
1 sample t-test Consider 1 population with unknown mean 1 sample
1 sample t-test (paired data) Compare 2 dependent samples (eg Before and After treatment) 1 sample of differences
2 sample t-test Compare 2 populations with unknown means 2 independent samples
  • The t.test command in R allows you to perform different 1 and 2 sample t-tests, by specifying different arguments.
    • paired
    • var.equal
    • alternative - We use this to specify if we are doing a 1 sided or 2 sided test (based on hypotheses). We will mostly focus on the 2 sided alternative.
help(t.test)


12.1.1 Sleep data zzzzzzz

  • Let’s consider the sleep data provided in R.
help(sleep)
dim(sleep)
class(sleep)
head(sleep)
  • The dataset sleep represents the effect of 2 soporific drugs (“group”) on sleep time compared to a control (drug - no drug).


12.1.2 2 sample t-test

  • Initially we are going to ignore the ID column (which shows patients trialed both drugs) and assume that the 2 groups are independent. ie That drug1 was administered to 10 patients and drug2 was administered to 10 different patients.

  • H: Hypotheses

    • Let \(\mu_1\) = mean net sleep of study participants taking drug1, and \(\mu_2\) = mean net sleep of study participants taking drug2.

    • \(H_{0}\) : There is no difference between the 2 drugs: \(\mu_1=\mu_2\).

    • \(H_{1}\) : There is a difference betwen the 2 drugs: \(\mu_1 \neq \mu_2\).

  • Being a medical study we are going to consider a more conservative significance level of \(\alpha = 0.01\).

  • A: Assumptions We assume the 2 samples are independent, and the 2 populations are both Normally distributed with the same SD.

  • T & P: Test statistic and P-value

t.test(extra~group, data = sleep, var.equal = TRUE, conf.level = 0.99)
  • C: Conclusion
    • Statistical conclusion: The p-value is = 0.07919 > 0.01. Furthermore, 0 is contained within the confidence intervals. Therefore we retain our null hypothesis.

    • Scientific conclusion: The data seems consistent with no difference in sleep time between patients taking drug1 and drug2.


  • Note: If we had found the spread of the 2 samples were significantly different from each other, we would have performed a 2 sample t-test for unequal variances (Welch 2 sample t-test) in this case we set the argument ‘var.equal = FALSE’
t.test(extra~group, data = sleep, var.equal = FALSE, conf.level = 0.99)
  • In this case, we see that the result is “similar” to the equal variance test. In fact many stats guru’s argue that you can almost always use Welch’s test except when the sample size is really small.


12.1.3 1 sample t-test (paired data)

  • Looking at the description of the sleep dataset we can see that the data is actually NOT independent.

  • Sleep was measured in 10 patients who were given both drug1 and drug2 at different times. A crossover study is a typical experimental design that would generate this type of data. When we only have 2 treatments, we call this paired data.

  • Therefore we need to perform a 1 sample t-test for paired data: a “paired t-test”. This is the same as a 1 sample t-test on the difference between the two treatments (i.e. drug1 - drug2). Therefore the assumptions for this test are equivalent to those for a 1 sample t-test (ie the data must be approximately normally distributed).

sleep_diff <- sleep$extra[sleep$group == 1]-sleep$extra[sleep$group == 2]
qqnorm(sleep_diff)
qqline(sleep_diff)
boxplot(sleep_diff)
  • The normal quantile-quantile plot and the boxplot show that there may be an outlier in the dataset.

  • However, for the purpose of this analysis and keeping in mind that these tests are reasonably robust to non-normal data, we will go ahead and test the data.

t.test(extra~group, data = sleep, paired = T)
  • The p-value is <0.05 and zero is not contained within the confidence intervals. Therefore we reject the null hypothesis and conclude that there is a significant difference in extra sleep hours between the 2 drugs. Looking at the earlier boxplot, we see that drug2 is more effective in inducing extra sleep than drug1.

  • Note: this demonstrates how important it is to use the correct analysis as this conclusion is different to when the 2 samples were assumed to be independent.


12.2 Chi-squared tests *

  • Chi-squared tests are used for qualitative data.

  • Qualitative data is divided into groups, like hair colour or gender. Even height and income can be analysed as qualitative data, if the quantitative data (eg 165cm, 170cm) is grouped into intervals (eg 150-160cm, 160-170cm, 170-180cm).

  • The Pearson’s chi-squared test is a very versatile test which compares the observed frequencies over the categories with the observed frequencies over the categories, under some model (null hypothesis).

    • Test the goodness of fit of a model for the categories of 1 variable. (Eg: This is similar to the Shapiro-Wilks test for normality).

    • Test for independence of 2 variables

    • Test for homogeneity of 2 variables

  • The test is done quickly in R using chisq.test.


12.2.1 Goodness of Fit Test (1 variable)

  • Two tomato varieties (dominant tall cut-leaf tomatoes and dwarf potato-leaf tomatoes) were genetically crossed, yielding 1006 offspring.

  • According to genetic theory, the 4 phenotypes should occur with an approximate ratio of 9:3:3:1.

  • Therefore we would expect:
    • 9/16 \(\times\) 1611 Tall cut-leaf plants = 906.1875
    • 3/16 \(\times\) 1611 Tall potato-leaf = 302.0625
    • 3/16 \(\times\) 1611 Dwarf cut-leaf = 302.0625
    • 1/16 \(\times\) 1611 Dwarf potato-leaf = 100.6875
  • H: Hypotheses
    • \(H_0\): \(p_1=9/16\); \(p_2=3/16\); \(p_3=3/16\); \(p_4=1/16\), where \(p_{i}\) = expected proportion in category \(i\).
    • \(H_1\): at least one proportion is not equal to the expected proportion.
  • Data

Phenotype Observed frequency Expected frequency O-E (O-E)^2/E
Tall cut-leaf 926 906.1875 19.8125 0.4332
Tall potato-leaf 288 302.0625 -14.0625 0.6547
Dwarf cut-leaf 293 302.0625 -9.0625 0.2719
Dwarf potato-leaf 104 100.6875 3.3125 0.1090
Total 1611 1611 0 1.4688
  • T,P&C

    • The test statistic is \(\chi^2\)=1.4688 with degrees of freedom 4-1 = 3. The p-value is 0.69.
pchisq(1.4688,3,lower.tail=F)

Note: we are only interested in the upper tail probability. As the p-value is 0.6895, we retain the null hypothesis and conclude that the model proposed by genetic theory fits the genetic outcome well.

  • The speedy way
observedfreq = c(926,288,293,104)  # observed frequencies
expectedprop = c(9/16,3/16,3/16,1/16)  # expected proportions

chisq.test(x = observedfreq,
           p = expectedprop)


12.2.2 Test for independence (2 variables; 1 sample)

  • Is age independent of the desire to ride a bicycle?
  • A random sample of 395 people were surveyed. Each person was asked their interest in riding a bicycle (yes/no) and their age (age group). The data that resulted from the survey is summarized in the following table:
Observed 18-24 25-34 35-49
Yes 60 54 46
No 40 44 53
Total 100 98 99
  • The research question is: Is the desire to ride a bicycle dependent on the age group?

  • H: Hypotheses

    • \(H_0\) = Age is independent of desire to ride a bicycle.

    • \(H_1\) = Desire to ride a bicycle is dependent on age.

  • Data

Bicycle <- matrix(c(60,54,46,41,40,44,53,57), nrow = 2, ncol = 4, byrow = TRUE, dimnames = list(c("Yes", "No"), c("18-24", "25-34", "35-49", "50-64")))
Bicycle
chisq.test(Bicycle)
  • T&P
    • From the output, the test statistic is 8.0061.

    • The p-value is 0.04589 (just less than 0.05) there is evidence against \(H_0\), so it appears the desire to ride a bicycle is dependent on age group.

  • Interpret further

    • We can look at the mosaic plot with shading to help interpret this.
mosaicplot(Bicycle, shade = TRUE, las=1)

Note: There is no shading, but we see that the boxes are slightly bigger than expected for the “yes” answer in the youngest group and slightly bigger than expected for the “no” answer in the oldest group. This suggests that cyling is more popular with younger people.


12.2.3 2. Test for Homogeneity (2 variables; 2 samples)

  • The test for homogeneity looks the same as the test for independence, but instead of testing whether 2 qualitative variables on 1 sample are independent, we test whether 2 samples have the same proportions and so suggest the same population.
  • The head surgeon at the University of Sydney medical center was concerned that trainee (resident) physicians were applying blood transfusions at a different rate than attending physicians.

  • Therefore, she ordered a study of the 49 attending physicians and 71 residents. For each of the 120 physicians, the number of blood transfusions prescribed unnecessarily in a one-year period was recorded. The number of uneccessary blood transfusions was grouped into four groups: Frequently, Occasionally, Rarely, or Never.

  • The contingency table is given below:

Physician Frequent Occasionally Rarely Never Total
Attending 2 3 31 13 49
Resident 15 28 23 5 71
Total 17 31 54 18 120
  • The research question is: Are the number of unecessary blood transfusions distributed equally amongst attending physicians and resident physicians?

  • H
    • \(H_0\) = \(p_{resFreq}=p_{attFreq}\) and \(p_{resOcc}=p_{attOcc}\) and \(p_{resRar}=p_{attRar}\) and \(p_{resNev}=p_{attNev}\)

    • \(H_1\) = at least one of the proportions is not equal

  • Data,T,P&C

BloodT <- matrix(c(2,3,31,13,15,28,23,5), nrow = 2, ncol = 4, byrow = TRUE, dimnames = list(c("Attending", "Resident"), c("Frequent", "Occasionally", "Rarely", "Never")))
BloodT
chisq.test(BloodT)

Note: The p-value is very small, therefore we reject the null hypothesis and conclude that at least one of the proportions is not equal.

  • Interpret further

    • To get a “rough” idea of which of the proportions are not equal we can take a look at the mosaic plot.
    mosaicplot(BloodT, shade = TRUE, las=1)
    • Note: the mosaic plot also shows the standardised residuals which are calculated as follows: \(\frac{\mbox{Observed Frequency - Expected Frequency}}{\sqrt{\mbox{Expected Frequency}}}\)

    • Note: The Attending physicians have less counts than expected in the “Occasional” group and the Resident (training) physicians have more than expected in the “Occasional” group. Furthermore, the Attending physicians have more counts than expected in the “Never” group.


Mosaic Plot: Residuals

As a general rule of thumb:

  • If the residual is less than -2: the cell’s observed frequency is less than the expected frequency.

  • If the residual is greater than 2: the observed frequency is greater than the expected frequency.

  • The shade = TRUE argument demonstrates our rule of thumb using colours on the mosaic plot. The blue indicates more counts than expected (>2) and the red indicates less counts than expected (<-2).


12.3 Regression test *

  • Earlier, we considered a linear model.
lm(mtcars$wt~mtcars$mpg)
  • Now we use this output to perform a hypothesis test.

  • H
    • \(H_0: \beta_1=0\) [Or: There is not a linear trend.]
    • \(H_0: \beta_1 \neq 0\) [Or: There is a linear trend.]
  • A

  1. The residuals should be independent, normal, with constant variance (homoscedasicity). Check: Residual Plot
lm(mtcars$wt~mtcars$mpg)
plot(mtcars$mpg,L$residuals, xlab = "mpg", ylab = "Residuals")
abline(h = 0, col = "blue")
  1. The relationship between the dependent and independent variable should look linear. Check; Scatter Plot
plot(mtcars$mpg, mtcars$wt)
  • T,P&C
summary(L)

The test statistic for slope is -9.559 with p-value 1.29e-10. Hence we strongly reject \(H_{0}\) and conclude that a linear model seems very appropriate for weight on mpg.



13 More tests for relationships (Diagnostics)

We can use hypothesis tests to test the assumptions of other hypothesis tests!


13.1 Test normality for 1+ populations: histogram, QQ-plot, Shapiro-Wilks Test

Here we illustrate using the car weights.

  1. Check the histogram.
hist(mtcars$wt, main = "Histogram of Car Weights", freq=F)
  • Note: The histogram is roughly normal (roughly balanced around one mode towards the centre).


  1. Check the QQ-plot.
  • A QQ-plot is another summary for quantitative variables. Further information can be found here.
help(qqnorm)
qqnorm(mtcars$wt, main = "QQ-plot for Car weights")
qqline(mtcars$wt, col = "blue")
  • Note: The data is reasonably evenly scattered around the theoretical quantile-quantile line which suggests that the car weights are reasonably normally distributed.


  1. Test usin the Shapiro-Wilks Test
  • Warning! The Shapiro-Wilks normality test is sensitive to sample size. Small samples will nearly always pass normality tests. Therefore, it is recomended to use both visual inspection and a significance test to decide whether a dataset is normally distributed.

  • H:
    • \(H_{0}\) : The data is normally distributed.
    • \(H_{1}\) : There data is not normally distributed.
    • We hope to retain \(H_{0}\)!
  • T,P&C:

shapiro.test(mtcars$wt)

Note: The p-value is much greater than 0.05, so strong evidence that the data is normally distributed.


13.2 Test equal variance of 2 populations (Boxplots; F-test; variance ratio)

  1. Compare the 2 boxplots.
boxplot(extra~group, data = sleep)
  • Note: It is clear from the boxplots that the variation in group 1 is less than group 2.

  • Boxplots are a very useful tool as they provide an indication of whether the sample means are similar or different, whether the samples are normally distributed, and whether the variances are similar between the 2 samples.

  1. Test using the Variance Test (Levene)
  • H:

    • Let \(\sigma_1^2\) = variance of the group 1 drug.

    • Let \(\sigma_2^2\) = variance of the group 2 drug.

    • \(H_{0}\) : There is no difference: \(\sigma_1^2=\sigma_2^2\)

    • \(H_{1}\) : There is a difference: \(\sigma_1^2 \neq \sigma_2^2\)

    • We hope to retain \(H_{0}\)!

  • T,P&C

var.test(extra~group, data = sleep)

Note: From the variance test, it is clear that the p-value is much greater than 0.05. Therefore we retain the null hypothesis and conclude that the variances seem to be equal.

14 FAQs

14.1 Why am I getting weird variable names?

Unfortunately this can happen on windows computers.You can fix it with the following code

names(smoking)[1] = “Age”

This code looks at the column names, picks the first one, and then reassigns it to Age.

14.2 How do I move the legend?

14.3 How can I drop columns in R?

Guide

14.4 How can I read in 1 sheet from a multiple sheet Excel file

data = read.xls( "filename" , sheet = 3, header = TRUE)

<!– ## How do I produce a residual plot with missing values in data?

library(ggplot2)
help(msleep)
plot(sleep$bodywt, sleep$brainwt)
L = lm(sleep$brainwt ~ sleep$bodywt)
summary(L)
abline(L)

# Correlation (with missing values)
length(sleep$bodywt) 
length(sleep$brainwt) 
cor(sleep$bodywt, sleep$brainwt, use = "complete.obs")  


# Residual plot (with missing values)
length(sleep$bodywt) 
length(L$residuals)
residuals1 = resid(lm(sleep$brainwt ~ sleep$bodywt, na.action=na.exclude)) 
length(residuals1)
plot(sleep$bodywt,residuals1)