mtcars
.R is an incredibly powerful open source language for statistical analysis which we will run through RStudio. If you have never coded before, it may seem hard at first, but it will soon become easier!
RStudio is available in all university labs, as a workspace on Ed Discussion Board, and on BYOdevice.
However, we also recommend that you install both R and RStudio onto your own computer.
R and RStudio are separate packages. First install R, and then install RStudio as the user interface.
Go to R Project.
Download the relevant version: Linux, Mac or Windows.
When you are finished, you should see an icon on your desktop with a large capital `R’.
Go to RStudio.
Download the installer for your platform.
When you are finished, you should see an icon on your desktop with a large RStudio.
Click on the large RStudio icon.
Go the console (LHS at the bottom, where you see the cursor >
).
1+ exp(3) + sin(0.5)
x=c(1,2,3)
x^2
sum(x)
sum(x^3)
Run
function. Note that the output is in the bottom Console.Many data sets are already loaded into R. They are good to experiment with, and are used in the R/LQuizzes.
mtcars
. To view the data, simple type its name.mtcars
head
or tail
functions.head(mtcars)
tail(mtcars, n=3)
help()
.help(mtcars)
It’s very important to set up a neat file management system for the semester. This is best practise, and will make your life easier!
DATA1001files
.RMarkdown is a clever document/file which can save all your Rcode and comments in one place for easy editing.
You have been given the file LabDemo.Rmd
on your Labs page on Canvas.
Download the file LabDemo.Rmd
, and store it in your new course folder DATA1001files
.
The information at the top is called the YAML, which you can edit it: eg in author: "xxx"
, replace your name for xxx
.)
Knit
.
This will create LabDemo.html
, which you can open in a browser. This becomes your final report.
You can create new RMarkdown files by duplicating LabDemo.Rmd
and renaming it Project1.Rmd
etc.
To customise your RMarkdown file, see RMarkdown Cheat Sheet
.Rmd
opens next to your input in the RStudio console. This makes it very easy to edit and see your results.Select R Markdown
Select ‘Viewer Pane’
Click ‘Apply’
LabDemo.Rmd'
file and see what happens - neat!There are many ways to import data into R.
You can import data directly from the web.
Road+
.road = read.csv("http://www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/2018S1/data/AllFatalities.csv")
Download a data set from the Canvas Lab page, for example AllFatalities.csv
from Lab3.
Put the datafile into theDATA1001files
folder where your .Rmd
. file is located.
# Given file.csv
road = read.csv("AllFatalities.csv")
data
within your DATA1001file
folder.data
and then read your data straight into R.# Given file.csv
road = read.csv("data/AllFatalities.csv")
file.choose() # choose your file, and then R gives you the pathway to that file
getwd() # this gives the current working directory
setwd("...") # set the working directory to the folder where your file is saved.
data = read.csv("2016Fatalities.csv") # read the data straight into R from that folder
PBS2015.xlsx
, then you need the package readxl
to first be installed. This data is found here.# Given file.xlsx
install.packages("readxl") # You only need to do this once in RStudio
library(readxl) # You need to do this each time
data = read_excel("data/PBS2015.xlsx")
What to do if a package is missing?
Error in library(somepackage) : there is no package called ‘somepackage’
No need to fret! All we need to do is install the package somepackage
from CRAN, which is a repository for R packages.
What is an R package? Good question. You can think of an R package like an app on the app store : it contains sets of functions that allow R to have extra functionality. For example, R was at the beginning, designed to be used very much like a calculator (very basic!). To be able to knit
reports, we need new packages to enable R to extend its capabilities beyond the most basic packages downloaded with R.
CRAN is very much like the App store - but for R!
So how do we access CRAN? Right through the command line!
In the console in R, you simply need to type
install.packages("somepackage")
(Note, of course, somepackage
is being used as a place holder here! Replace with the name of the actual package that is missing).
In summary:
install.packages("somepackage")
library
.library("somepackage")
Data preparation is an essential part of statistical analysis and can be very time-consuming. It can involve cleaning or tidying, cleansing scrubbing, reshaping, reforming, splitting and combining. It must be performed carefully and transparently.
The aim is to change Messy (or Raw or Dirty) Data into Clean Data, which is correctly and consistently formatted and ready for analysis.This can involve removing redundant or useless information, standardising information (like calendar dates), separating or combining columns, and dealing with warnings.
Simple data can be cleaned in Excel. Get rid of any extra formatting, so that the data looks like:
ID (if applicable) | Variable1 | Variable2 | Variable3 |
---|---|---|---|
1 | 14 | 25 | 34.4 |
2 | 15 | 23 | 19.7 |
tidyverse
. Install the package, as a one off command.install.packages("tidyverse")
library("tidyverse")
We will use mtcars
to illustrate the structure of data.
Recall there are 2 main types of variables: qualitative and quantitative, which R calls Factor
and num
.
View the data.
mtcars
head(mtcars)
tail(mtcars, n=3)
help(mtcars)
dim
ensions of the data set.dim(mtcars)
This means that there are 32 rows (the types of cars) and 11 variables (properties of the cars).
names
of the variables.names(mtcars)
str
ucture of the data.str(mtcars)
where ‘num’ is a quantitative (numerical) variable and ‘Factor’ is a qualitative (categorical) variable.
DataName$VariableName
and store the result in a vector.mpg= mtcars$mpg
Note that RStudio has code completion, so will auto-predict your commands. When you type mtcars$
, the names of the all the variables will come up.
class
ification of 1 variable.class(mpg)
str(mpg)
length
of 1 variable.length(mpg)
sum
of a (quantitative) variable.sum(mpg)
Sort
the data in increasing order.sort(mpg)
sort(mpg, decreasing = T)
sum(sort(mpg)[1:5])
mpg
mpg[1]
mpg[5]
mpg[c(1,5)]
mtcars$mpg[c(1,5)]
mtcars[1,1]
mtcars[5,1] #mpg is 1st column.
str(mtcars)
carb
is classified as num
. Reclassify carb
as a factor
.class(mtcars$carb)
carbF = factor(mtcars$carb)
class(carbF)
factor
to a num
:ageCanVote = factor(setNames(c(16, 18, 18, "Unknown"), c("Austria", "Australia", "Afghanistam", "Zambia")))
as.numeric(ageCanVote) # This is a mistake, as it converts to the rank of the factor level
as.numeric(as.character(ageCanVote)) # This converts properly
## Warning: NAs introduced by coercion
The graphical summary must match up with the type of variable(s).
Variable | Type of summary |
---|---|
1 Qualitative | (Single) Barplot |
1 Quantitative | Histogram or (Single) Boxplot |
2 Qualitative | Double Barplot, Mosaicplot |
2 Quantitative | Scatterplot |
1 Quantitative, 1 Qualitative | Double Boxplot |
A barplot is used for qualitative variables.
Which variables are qualitative in mtcars
?
Produce a single barplot
of the gears.
barplot(mtcars$gear)
Notice this is not useful! This is because R has classified gear
as a quantitative variable.
gears
into a table as follows.# Produce frequency table
table(mtcars$gear)
##
## 3 4 5
## 15 12 5
# produce barplot (from frequency table)
counts = table(mtcars$gear)
barplot(counts)
help(barplot)
barplot(counts, names.arg=c("3 Gears","4 Gears","5 Gears"),col="lightblue")
par(las=2)
barplot(counts, names.arg=c("3 Gears","4 Gears","5 Gears"),col="lightblue")
gear
and cyl
. Produce a double barplot by faceting or filtering the barplot of gear
by cyl
.counts1 = table(mtcars$cyl, mtcars$gear)
barplot(counts1,names.arg=c("3 Gears","4 Gears","5 Gears"),col=c("lightblue","lightgreen","lightyellow"),legend = rownames(counts1))
barplot(counts1,names.arg=c("3 Gears","4 Gears","5 Gears"),col=c("lightblue","lightgreen","lightyellow"),legend = c("4 cyl","6 cyl","8 cyl"))
barplot(counts1, names.arg = c("3 Gears", "4 Gears", "5 Gears"), col = c("lightblue",
"lightgreen", "lightyellow"), legend = c("4 cyl", "6 cyl", "8 cyl"), beside = TRUE)
What do you learn?
A histogram is used for quantative variables.
Which variables are quantitative in mtcars
?
Produce a hist
ogram of the weights.
hist(mtcars$wt)
hist
ogram of the weights. What is the difference? Why do the 2 histograms have an identical shape here?hist(mtcars$wt,freq=F)
In this course, we will consider the probability histogram (2nd one) which means that the total area of the histogram is 1.
What does the histogram tell us about weights of the cars?
To see what customisations for hist
are available, use help
.
help(hist)
hist
ogram of the weights.hist(mtcars$wt, br=seq(0,6,by=0.5), freq=F, col="lightgreen",xlab="weight of cars (1000 lbs)",main="Histogram of Weights of Cars US 1973-74")
Experiment with the customisations to see how they work.
Try a hist
ogram of the gross horsepower. What do you learn?
Produce a hist
ogram of mpg. What do you learn?
A boxplot is another summary for quantitative variables.
boxplot(mtcars$wt)
boxplot(mtcars$wt, horizontal = T)
Which orientation do you prefer?
Compare to the histogram of weights above: what different features are highlighted by a boxplot?
help(boxplot)
wt
by cyl
.boxplot(mtcars$wt~mtcars$cyl)
boxplot(mtcars$wt~mtcars$cyl, names=c("4 cyl", "6 cyl","8 cyl"),ylab="Weight of cars (1000 lbs)")
What do you learn about the weights of cars? Car Weight - see page 6
A mosaic plot visualises 2 qualitative variables.
counts2 = table(mtcars$gear, mtcars$am) # Produces contingency table
plot(counts2) # Produces mosaic plot from contingency table
A scatter plot considers the relationship between 2 quantitative variables.
plot(mtcars$wt,mtcars$mpg)
plot(mtcars$wt,mtcars$mpg, xlab="Car Weight", ylab="Miles per Gallon",col="darkred",pch=19)
abline(lm(mtcars$mpg~mtcars$wt))
We’ll expore this further in Section 6.
plot
or pairs
.plot(mtcars)
pairs(~mpg+disp+drat+wt,data=mtcars)
pairs(mtcars)
Which variables seem to be related?
All the previous (base R) plots can be done in ggplot
, allowing much greater customisation.
ggplot
into RStudio. This is a one off command.install.packages("ggplot2")
ggplot2
packagelibrary(ggplot2)
# mtcars data
p = ggplot(mtcars, aes(x=factor(cyl))) # Select the mtcars data, and focus on cyl as factor (qualitative) on x axis
p + geom_bar() # Produce a barplot
# mpg data
p1 = ggplot(mpg, aes(class)) # Select the mpg data, and focus on class as x axis
p1 + geom_bar() # (1) Produce a barplot
p1 + geom_bar(aes(weight = displ)) # (2) Produce a barplot with counts from displacent variable
p1 + geom_bar(aes(fill = drv)) # (3) Produce a (double) barplot divided by the drive variable
p1 +
geom_bar(aes(fill = drv), position = position_stack(reverse = TRUE)) +
coord_flip() +
theme(legend.position = "top") # (4) Customising (3)
Produce a hist
ogram of the weights.
p = ggplot(data=mtcars, aes(x=wt)) # Select the mtcars data, and focus on wt (quantitative) on x axis
p + geom_histogram(aes(y=..density..),binwidth=0.5)
+ xlab('Weight')+ylab('Density') # Produce a histogram with x and y axis labels
Using aes(y=..density..)
turns a raw histogram into a probability histogram.
p = ggplot(data=mtcars, aes(x="", y=wt)) # Select the mtcars data, and focus on wt (quantitative) on y axis (with no filtering on x axis)
p + geom_boxplot() # Produce a boxplot
p = ggplot(data=mtcars, aes(x=factor(cyl),y=wt)) # Select the mtcars data, and focus on wt (quantitative) on y axis and cyl (qualitative) on x axis
p + geom_boxplot() # Produce a boxplot, of wt filtered by cyl
geom_jitter
plots the points with a small amount of random noise. We use it to investigate over-plotting in small data sets.p = ggplot(data=mtcars, aes(x=factor(cyl),y=wt))
p + geom_boxplot() + geom_jitter()
What do the following customisations do?
p = ggplot(data=mtcars, aes(x=factor(cyl),y=wt))
p + geom_boxplot() + coord_flip()
p + geom_boxplot(notch = TRUE)
p + geom_boxplot(outlier.colour = "green", outlier.size = 3)
Now consider dividing the weights by another qualitative variable.
p + geom_boxplot(aes(fill = factor(cyl)))
p + geom_boxplot(aes(fill = factor(am)))
p = ggplot(mtcars, aes(wt, mpg)) # Select the mtcars data, and focus on wt (quantitative) on x axis and mpg (quantiative) on y axis
p + geom_point() # Produce a scatterplot of mpg vs wt
p + geom_point(aes(colour = factor(cyl))) # Colour the points by cyl (qualitative)
p + geom_point(aes(shape = factor(cyl))) # Shape the points by cyl (qualitative)
p + geom_point(aes(shape = factor(cyl))) + scale_shape(solid = FALSE)
p + geom_point(aes(size = qsec)) # Size the points by qsec (qualitative)
p + geom_point(aes(colour = cyl)) + scale_colour_gradient(low = "blue") # Colour the points by cyl (quantitative)
plotly
or plot.ly
is included in the `ggplot2 package and allows you to have offline interactive tools.
library("plotly")
p1 = plot_ly(mtcars, x = ~mpg, y = ~wt, type="scatter")
p1
p2 = plot_ly(mtcars, x = ~mpg, y = ~wt, symbol = ~cyl, type="scatter")
p2
p3 = plot_ly(mtcars, x = ~mpg, y = ~wt, color = ~cyl, type="scatter")
p3
p4 = plot_ly(mtcars, x = ~mpg, y = ~wt, color = ~factor(cyl), type="scatter")
p4
library("plotly")
p5 = plot_ly(mtcars, x = ~mpg, y = ~factor(cyl), type='box')
The numerical summary must match up with the type of variable(s).
Variable | Type of summary |
---|---|
1 Qualitative | frequency table, most common category |
1 Quantitative | mean, median, SD, IQR etc |
2 Qualitative | contingency table |
2 Quantitative | correlation, linear model |
1 Quantitative, 1 Qualitative | mean, median, SD, IQR etc across categories |
We’ll keep working with the mtcars
dataset. So again remind yourself what it is like.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
help(mtcars)
table(mtcars$gear)
table(mtcars$gear, mtcars$am)
mean(mtcars$gear)
median(mtcars$gear)
The standard deviation measures spread for quantitative variables.
The sd
command calculates the sample standard deviation. The squared SD is the variance.
sd(mtcars$gear)
sd(mtcars$gear)^2
var(mtcars$gear)
popsd
command calculates the population standard deviation, but requires the multicon
package. #install.packages(multicon) # a package only needs to be installed once.
library(multicon)
popsd(mtcars$gear)
# Longer way
N = length(mtcars$gear)
sd(mtcars$gear)*sqrt((N-1)/N)
IQR
.IQR(mtcars$gear)
quantile
command, and then work out the IQR.quantile(mtcars$gear)
quantile(mtcars$gear)[4]-quantile(mtcars$gear)[2]
What is the 50% quantile equivalent to?
summary
, which is an expanded version of the 5 number summary. Sometimes these values will vary from using quantile
as there are different conventions for calculating quartiles.summary(mtcars$mpg) # 1 variable
summary(mtcars) # All variables
summary(mtcars$mpg[mtcars$wt>=3])
Here we take all the data from mtcars dataset for a specific cylinder e.g. 6.
mtcars[ which(mtcars$cyl==6), ]
cor(mtcars)
par(mfrow=c(1,2)) # plots 2 plots on same row
curve(dnorm(x),-1,1)
curve(dnorm(x),-5,5)
When does the Normal curve start to disappear?
polygon
.curve(dnorm(x), xlim = c(-2, 2),ylab="",axes=FALSE)
abline(h = 0)
sequence = seq(-2, 0.8, 0.1)
polygon(x = c(sequence,0.8,-2),
y = c(dnorm(c(sequence)),0,0),
col = "purple")
axis(1, at=c(-2,-1,0,0.8,1,2), pos=0)
hist(mtcars$wt,prob=T)
m=mean(mtcars$wt)
s=sd(mtcars$wt)
curve(dnorm(x, mean=m, sd=s),
col="darkgreen", lwd=2, add=TRUE)
To calculate P(Z < 0.8) use pnorm
.
pnorm(0.8)
To calculate P(Z > 0.8):
pnorm(0.8,lower.tail=F)
1-pnorm(0.8)
To calculate P(0.3 < Z < 0.7):
pnorm(0.7)-pnorm(0.3)
If \(X \sim N(5, 4^2)\), then to calculate P(X < 8), we use pnorm(8,mean,sd)
.
pnorm(8,mean=5,sd=4)
pnorm(8,5,4)
plot(<x variable>, <y variable>)
.plot(mtcars$mpg,mtcars$wt)
plot(mtcars$mpg,mtcars$wt, xlab="mpg", ylab="weight", main="Scatter plot of weight vs mpg")
cor(<x variable>, <y variable>)
.cor(mtcars$mpg,mtcars$wt)
lm(<y variable>~<x variable>)
.lm(mtcars$wt~mtcars$mpg)
#Or
L=lm(mtcars$wt~mtcars$mpg)
L$coeff
#Or
lm(wt~mpg,data=mtcars)
plot(mtcars$mpg,mtcars$wt)
abline(L)
The residual plot helps to detect any pattern not captured by the linear model.
To produce a residual plot:
plot(mtcars$mpg,L$residuals, xlab = "mpg", ylab = "Residuals")
abline(h = 0, col = "blue")
If a linear model is not appropriate, as evidenced by either the scatter plot (non-linear) or residual plot (non-random), we can either try a transformation or directly fit a more complex model.
x = mtcars$wt
y = mtcars$mpg
plot(x,y)
cor(x,y)
plot(log(x),log(y)) # scatterplot of log of data
cor(log(x),log(y))
plot(x,y)
# Set seed for fixing 1 simulation
set.seed(1)
# Simulate the x values
simx=seq(min(x),max(x),length.out=length(x))
# Model the y values y=12+ke^(-x), with a starting value of k
model = nls(y~12+k*exp(-simx), start=list(k=500))
summary(model)
# Add model to scatter plot
lines(simx,predict(model),lty=2,col="green",lwd=3)
set.seed(1) # ensures the same simulation, "1" can be whatever you choose
sample(c("H", "T"), 1, replace = T)
set.seed(1)
sample(c("H", "T"), 1, replace = T,prob = c(0.9, 0.1))
set.seed(1)
sample1 = sample(c("H", "T"), 10,replace=T)
table(sample1)/10
set.seed(1)
replicate(10, sample(c("H", "T"),1,replace=T))
set.seed(1)
rbinom(n = 10, 1, 0.5)
set.seed(1)
sample(c(1:6), 1, replace = T)
set.seed(1)
sample(c(1:6), 1, replace = T,prob = c(0.1,0.2,0.3,0.2,0.1,0.1))
set.seed(1)
sample(c(1:6), 10, replace = T)
table(sample(c(1:6), 10, replace = T))/10
barplot(table(sample(c(1:6), 10, replace = T))/10, xlab="dice face", ylab="chance")
set.seed(1)
table(replicate(10, sample(c(1:6), 10, replace = T)))
To toss a die 10 times, and find the chance of getting a 6:
tossdie = function() {
rolls = sample(1:6, size = 10, replace = TRUE)
condition = sum(rolls == 6) > 0
return(condition)
}
# Simulate 10000 times
sim = replicate(10000, tossdie())
sum(sim)/length(sim)
Suppose we have 10 trials with P(event) = 0.3.
The chance of getting 4 events:
dbinom(4,10,0.3)
The chance of getting less than or equal to 5 events:
pbinom(5,10,0.3)
Suppose we have a “box” (or population) containing 0 and 1, from which we draw 10 times creating a sample. We have formal theory which gives the expected value (EV) and standard error (SE) of both the Sample Sum (sum of the sample values and Sample Mean (mean of the sample values).
Focus | EV | SE |
---|---|---|
Sample Sum | number of draws * mean of box | \(\sqrt{\mbox{number of draws}}\) * SD of box |
Sample Mean | mean of box | \(\frac{SD}{\sqrt{\mbox{number of draws}}}\) |
What matters most is being able to draw a box model by hand. But below is some code for those who are interested.
Install the package DiagrammeR
. If you have trouble installing, you can just use it on the uni labs.
Draw a box model in R.
library("DiagrammeR")
grViz("
digraph CFA {
# All
node [fontname = Helvetica, fontcolor = White]
# Box
node [shape = box, style=filled, color=SteelBlue4, width=2 label='0 1'][fillcolor = SteelBlue4]
a ;
# Sample
node [shape = circle, style=filled, color=IndianRed, width=0.5, label='-'][fillcolor = IndianRed]
b ;
# Draws
a -> b [fontname = Helvetica,label = '10 draws',fontsize=8]
b -> a [fontname = Helvetica,color=grey,arrowsize = 0.5]
}
")
box=c(0,1) # Define the box
n = 10 # Define the sample size (number of draws)
# mean & SD of box
meanbox=mean(box)
library(multicon)
sdbox=popsd(box)
# Other calculations of the population SD.
# (1) RMS formula
sqrt(mean((box - mean(box))^2))
# (2) short cut for binary box
(max(box) - min(box)) * sqrt(length(box[box == max(box)]) * length(box[box ==
min(box)])/length(box)^2)
# EV & SE of Sample Sum
n * meanbox # EV
sqrt(n) * sdbox # SE
set.seed(1)
totals = replicate(20, sum(sample(box, 10, rep = T)))
table(totals)
mean(totals)
sd(totals)
hist(totals)
How do the simulations compare to the exact results?
What is the chance that the sample sum is between 4 and 5?
Draw the curve (Ext).
m = n * meanbox
s = sqrt(n) * sdbox
threshold1 = 4
threshold2 = 5
xlimits1=seq(round(-4*s+m, digit=-1),round(4*s+m, digit=-1),1)
curve(dnorm(x, mean=m, sd=s), m-4*s,m+4*s, col="indianred", lwd=2, xlab="", ylab="", main ="P(sum is between 4 and 5)",axes=F)
sequence <- seq(threshold1, threshold2, 0.1)
polygon(x = c(sequence,threshold2,threshold1),
y = c(dnorm(c(sequence),m,s),0,0),
col = "indianred")
axis(1,xlimits1,line=1,col="black",col.ticks="black",col.axis="black")
mtext("sum of box",1,line=1,at=-2,col="black")
axis(1,xlimits1,labels=round((xlimits1-m)/s,2),line=3.3,col="indianred1",col.ticks="indianred1",col.axis="indianred1")
mtext("Z score",1,line=3.3,at=-2,col="indianred1")
m = n * meanbox # Use EV as the mean of the Normal
s = sqrt(n) * sdbox # Use SE as the SD of the Normal
pnorm(5, m, s)-pnorm(4, m, s)
# EV & SE of Sample Mean
meanbox # EV
sdbox/sqrt(n) # SE
What is the chance that the sample mean is between 0.4 and 0.6?
Find the chance.
m = meanbox # Use EV as the mean of the Normal
s = sdbox/sqrt(n) # Use SE as the SD of the Normal
pnorm(0.6, m, s)-pnorm(0.4, m, s)
Simulate a survey of 25 people taken from a population of size 200 with 2 types of people in the proportion 1:3.
This is equivalent to taking 25 draws from a box containing 50 1’s and 150 0’s.
Note: A survey is technically without replacement, but here we will approximate by the box model which assumes replacement.
library("DiagrammeR")
grViz("
digraph CFA {
# All
node [fontname = Helvetica, fontcolor = White]
# Box
node [shape = box, style=filled, color=SteelBlue4, width=2 label='50x1s 150x0s'][fillcolor = SteelBlue4]
a ;
# Sample
node [shape = circle, style=filled, color=IndianRed, width=0.5, label='-'][fillcolor = IndianRed]
b ;
# Draws
a -> b [fontname = Helvetica,label = '25 draws',fontsize=8]
b -> a [fontname = Helvetica,color=grey,arrowsize = 0.5]
}
")
box=c(rep(1,50),rep(0,150)) # Define the box/population
n = 25 # Number of draws (sample size)
# mean & SD of box
meanbox = mean(box)
sdbox = popsd(box)
n*meanbox # EV
sqrt(n)*sdbox # SE
meanbox # EV
sdbox/sqrt(n) # SE
# Bootstrap box model
boxest0 = 200*4/10
boxest1 = 200*6/10
box = c(rep(1, boxest1), rep(0, boxest0))
n = 10 # Size of sample (= number of draws)
# Mean & SD of box
meanbox = mean(box) # population proportion of 1's
sdbox = popsd(box)
# EV & SE of Sample Mean (= sample proportion of 1's)
meanbox
sdbox/sqrt(n)
Note the expected value of sample proportion of 1’s (0.6), is equal to the population proportion of 1s (0.6), as it should be.
Hence, we expect 60% 1s in the sample with a SE of around 15 percentage points.
A survey of 30 people is taken from a population of size 600 with 2 types of people in the proportion 1:2.
This is equivalent to taking 30 draws from a box containing 200 1’s and 400 0’s (assuming replacement).
Define the box (mean, SD)
box=c(rep(1,200),rep(0,400)) # Define the box/population
n = 30 # Number of draws (sample size)
# mean & SD of box
meanbox = mean(box)
sdbox = popsd(box)
# EV & SE of sample proportion of 1's
ev = meanbox
se = sdbox/sqrt(n)
# 68% CI
c(ev - 1*se,ev + 1*se)
# 95% CI
c(ev - 2*se,ev + 2*se)
# 99.7% CI
c(ev - 3*se,ev + 3*se)
Suppose we have a sample of size \(n\) from a population with proportion \(p\) of a certain trait.
We simulate samples from the box model, and then compare our actual sample to these results - ie how common is our sample?
We want to test \(H_{0}\): \(p = 0.5\).
We can produce a picture of the box model modelling \(H_{0}\) (Ext).
library("DiagrammeR")
DiagrammeR::grViz("
digraph rmarkdown {
graph [fontsize = 16, fontname = Arial, nodesep = .1, ranksep = .8]
node [fontsize = 16, fontname = Arial, fontcolor = White]
edge [fontsize = 12, fontname = Arial, width = 2]
Box [shape=oval,style=filled, color=SteelBlue3,width=5, label='1 0']
Sample [shape=oval, style=filled, color=SteelBlue2, label='']
Box -> Sample [label=' n draws']
}
")
detach(package:DiagrammeR)
set.seed(1)
# Define box (modelling Ho)
box=c(0,1)
# # Simulate 100 samples of size 20 from the box
totals = replicate(100, sum(sample(box, 20, rep = T)))
table(totals)
hist(totals)
We want to test \(H_{0}\): \(p = 0.2\).
We can produce a picture of the box model modelling \(H_{0}\) (Ext).
library("DiagrammeR")
DiagrammeR::grViz("
digraph rmarkdown {
graph [fontsize = 16, fontname = Arial, nodesep = .1, ranksep = .8]
node [fontsize = 16, fontname = Arial, fontcolor = White]
edge [fontsize = 12, fontname = Arial, width = 2]
Box [shape=oval,style=filled, color=SteelBlue3,width=5, label='100p x 1 100(1-p) x 0']
Sample [shape=oval, style=filled, color=SteelBlue2, label='']
Box -> Sample [label=' n draws']
}
")
detach(package:DiagrammeR)
set.seed(1)
# Define box (modelling Ho)
box=c(0,1)
# Simulate 1000 samples of size 30 from the box
totals = replicate(1000, sum(sample(box, 30, prob=c(0.8,0.2), rep = T)))
table(totals)
hist(totals)
Suppose we have a sample of size \(n\) from a population with unknown \(\mu\).
We want to test a null hypothesis for the value of \(\mu\), by modelling \(H_{0}\) by a box model with \(n\) draws.
In 2013, the average miles per gallon (mpg) for cars sold in the US was 23.6. We want to see if this is a significant difference from the mtcars
data.
Formally, we write \(H_0: μ=23.6\) vs \(H_1: μ \neq 23.6\).
Have a quick look to see if \(H_0: μ=23.6\) seems to fit with data (eg right units and size etc).
mean(mtcars$mpg) # sample mean
length(mtcars$mpg) # size of sample
For a \(t\) test (or \(Z\) test), we need to check the assumption of normality.
hist(mtcars$mpg)
boxplot(mtcars$mpg)
Notice the boxplot looks symmetric, which is consistent with normality. Note the histogram shows some light right skewing.
qqnorm(mtcars$mpg)
shapiro.test(mtcars$mpg)
Check to see if the Q-Q Plot looks linear, as this indicates normality. The Shapiro test tests the null hypothesis that the sample comes from a Normal distribution. Here the p-value is quite big (0.1229) hence we would retain \(H_0\) which suggests normality.
par(mfrow=c(2,2))
qqnorm(mtcars$mpg)
shapiro.test(mtcars$mpg)
hist(mtcars$mpg)
boxplot(mtcars$mpg)
The formula for the \(t\)-test statistic is \(t_{obs} = \frac{\mbox{observed value - hypothesised value}}{\mbox{standard error}}\).
We use the SD of the data as an approximation of the SD of the population, so we can calculate the SE of the Sample Mean.
tobs = (mean(mtcars$mpg)-23.6)/(sd(mtcars$mpg)/sqrt(32))
length(mtcars$mpg)-1
2*(1-pt(abs(tobs),31))
abs
here, allows for test statistic to be in either tail.mtcars
.t.test(mtcars$mpg, mu = 23.6)
z.test
ourselves.mtcars$mpg
is the same as the sample variance (this of course is only a guess).v = var(mtcars$mpg)
# This creates a function called z.test, with data, mu and var as the inputs, and zobs as the output.
z.test = function(data, mu, var){
zobs = (mean(data) - mu) / (sqrt(var / length(data)))
return(zobs)
}
# Run the function with inputs
z.test(mtcars$mpg,23.6,v)
2*(1-pnorm(abs(-3.293877)))
mtcars
.The choice of the t-test depends both on the design of the experiment and the nature of the data.
Type | Context | Data |
---|---|---|
1 sample t-test | Consider 1 population with unknown mean | 1 sample |
1 sample t-test (paired data) | Compare 2 dependent samples (eg Before and After treatment) | 1 sample of differences |
2 sample t-test | Compare 2 populations with unknown means | 2 independent samples |
t.test
command in R allows you to perform different 1 and 2 sample t-tests, by specifying different arguments.
help(t.test)
sleep
data provided in R.help(sleep)
dim(sleep)
class(sleep)
head(sleep)
sleep
represents the effect of 2 soporific drugs (“group”) on sleep time compared to a control (drug - no drug).Initially we are going to ignore the ID column (which shows patients trialed both drugs) and assume that the 2 groups are independent. ie That drug1 was administered to 10 patients and drug2 was administered to 10 different patients.
H: Hypotheses
Let \(\mu_1\) = mean net sleep of study participants taking drug1, and \(\mu_2\) = mean net sleep of study participants taking drug2.
\(H_{0}\) : There is no difference between the 2 drugs: \(\mu_1=\mu_2\).
\(H_{1}\) : There is a difference betwen the 2 drugs: \(\mu_1 \neq \mu_2\).
Being a medical study we are going to consider a more conservative significance level of \(\alpha = 0.01\).
A: Assumptions We assume the 2 samples are independent, and the 2 populations are both Normally distributed with the same SD.
T & P: Test statistic and P-value
t.test(extra~group, data = sleep, var.equal = TRUE, conf.level = 0.99)
Statistical conclusion: The p-value is = 0.07919 > 0.01. Furthermore, 0 is contained within the confidence intervals. Therefore we retain our null hypothesis.
Scientific conclusion: The data seems consistent with no difference in sleep time between patients taking drug1 and drug2.
t.test(extra~group, data = sleep, var.equal = FALSE, conf.level = 0.99)
Looking at the description of the sleep
dataset we can see that the data is actually NOT independent.
Sleep was measured in 10 patients who were given both drug1 and drug2 at different times. A crossover study is a typical experimental design that would generate this type of data. When we only have 2 treatments, we call this paired data.
Therefore we need to perform a 1 sample t-test for paired data: a “paired t-test”. This is the same as a 1 sample t-test on the difference between the two treatments (i.e. drug1 - drug2). Therefore the assumptions for this test are equivalent to those for a 1 sample t-test (ie the data must be approximately normally distributed).
sleep_diff <- sleep$extra[sleep$group == 1]-sleep$extra[sleep$group == 2]
qqnorm(sleep_diff)
qqline(sleep_diff)
boxplot(sleep_diff)
The normal quantile-quantile plot and the boxplot show that there may be an outlier in the dataset.
However, for the purpose of this analysis and keeping in mind that these tests are reasonably robust to non-normal data, we will go ahead and test the data.
t.test(extra~group, data = sleep, paired = T)
The p-value is <0.05 and zero is not contained within the confidence intervals. Therefore we reject the null hypothesis and conclude that there is a significant difference in extra sleep hours between the 2 drugs. Looking at the earlier boxplot, we see that drug2 is more effective in inducing extra sleep than drug1.
Note: this demonstrates how important it is to use the correct analysis as this conclusion is different to when the 2 samples were assumed to be independent.
Chi-squared tests are used for qualitative data.
Qualitative data is divided into groups, like hair colour or gender. Even height and income can be analysed as qualitative data, if the quantitative data (eg 165cm, 170cm) is grouped into intervals (eg 150-160cm, 160-170cm, 170-180cm).
The Pearson’s chi-squared test is a very versatile test which compares the observed frequencies over the categories with the observed frequencies over the categories, under some model (null hypothesis).
Test the goodness of fit of a model for the categories of 1 variable. (Eg: This is similar to the Shapiro-Wilks test for normality).
Test for independence of 2 variables
Test for homogeneity of 2 variables
The test is done quickly in R using chisq.test
.
Two tomato varieties (dominant tall cut-leaf tomatoes and dwarf potato-leaf tomatoes) were genetically crossed, yielding 1006 offspring.
According to genetic theory, the 4 phenotypes should occur with an approximate ratio of 9:3:3:1.
Data
Phenotype | Observed frequency | Expected frequency | O-E | (O-E)^2/E |
---|---|---|---|---|
Tall cut-leaf | 926 | 906.1875 | 19.8125 | 0.4332 |
Tall potato-leaf | 288 | 302.0625 | -14.0625 | 0.6547 |
Dwarf cut-leaf | 293 | 302.0625 | -9.0625 | 0.2719 |
Dwarf potato-leaf | 104 | 100.6875 | 3.3125 | 0.1090 |
Total | 1611 | 1611 | 0 | 1.4688 |
T,P&C
pchisq(1.4688,3,lower.tail=F)
Note: we are only interested in the upper tail probability. As the p-value is 0.6895, we retain the null hypothesis and conclude that the model proposed by genetic theory fits the genetic outcome well.
observedfreq = c(926,288,293,104) # observed frequencies
expectedprop = c(9/16,3/16,3/16,1/16) # expected proportions
chisq.test(x = observedfreq,
p = expectedprop)
Observed | 18-24 | 25-34 | 35-49 |
---|---|---|---|
Yes | 60 | 54 | 46 |
No | 40 | 44 | 53 |
Total | 100 | 98 | 99 |
The research question is: Is the desire to ride a bicycle dependent on the age group?
H: Hypotheses
\(H_0\) = Age is independent of desire to ride a bicycle.
\(H_1\) = Desire to ride a bicycle is dependent on age.
Data
Bicycle <- matrix(c(60,54,46,41,40,44,53,57), nrow = 2, ncol = 4, byrow = TRUE, dimnames = list(c("Yes", "No"), c("18-24", "25-34", "35-49", "50-64")))
Bicycle
chisq.test(Bicycle)
From the output, the test statistic is 8.0061.
The p-value is 0.04589 (just less than 0.05) there is evidence against \(H_0\), so it appears the desire to ride a bicycle is dependent on age group.
Interpret further
mosaicplot(Bicycle, shade = TRUE, las=1)
Note: There is no shading, but we see that the boxes are slightly bigger than expected for the “yes” answer in the youngest group and slightly bigger than expected for the “no” answer in the oldest group. This suggests that cyling is more popular with younger people.
The head surgeon at the University of Sydney medical center was concerned that trainee (resident) physicians were applying blood transfusions at a different rate than attending physicians.
Therefore, she ordered a study of the 49 attending physicians and 71 residents. For each of the 120 physicians, the number of blood transfusions prescribed unnecessarily in a one-year period was recorded. The number of uneccessary blood transfusions was grouped into four groups: Frequently, Occasionally, Rarely, or Never.
The contingency table is given below:
Physician | Frequent | Occasionally | Rarely | Never | Total |
---|---|---|---|---|---|
Attending | 2 | 3 | 31 | 13 | 49 |
Resident | 15 | 28 | 23 | 5 | 71 |
Total | 17 | 31 | 54 | 18 | 120 |
The research question is: Are the number of unecessary blood transfusions distributed equally amongst attending physicians and resident physicians?
\(H_0\) = \(p_{resFreq}=p_{attFreq}\) and \(p_{resOcc}=p_{attOcc}\) and \(p_{resRar}=p_{attRar}\) and \(p_{resNev}=p_{attNev}\)
\(H_1\) = at least one of the proportions is not equal
Data,T,P&C
BloodT <- matrix(c(2,3,31,13,15,28,23,5), nrow = 2, ncol = 4, byrow = TRUE, dimnames = list(c("Attending", "Resident"), c("Frequent", "Occasionally", "Rarely", "Never")))
BloodT
chisq.test(BloodT)
Note: The p-value is very small, therefore we reject the null hypothesis and conclude that at least one of the proportions is not equal.
Interpret further
mosaicplot(BloodT, shade = TRUE, las=1)
Note: the mosaic plot also shows the standardised residuals which are calculated as follows: \(\frac{\mbox{Observed Frequency - Expected Frequency}}{\sqrt{\mbox{Expected Frequency}}}\)
Note: The Attending physicians have less counts than expected in the “Occasional” group and the Resident (training) physicians have more than expected in the “Occasional” group. Furthermore, the Attending physicians have more counts than expected in the “Never” group.
Mosaic Plot: Residuals
As a general rule of thumb:
If the residual is less than -2: the cell’s observed frequency is less than the expected frequency.
If the residual is greater than 2: the observed frequency is greater than the expected frequency.
The shade = TRUE
argument demonstrates our rule of thumb using colours on the mosaic plot. The blue indicates more counts than expected (>2) and the red indicates less counts than expected (<-2).
lm(mtcars$wt~mtcars$mpg)
Now we use this output to perform a hypothesis test.
A
lm(mtcars$wt~mtcars$mpg)
plot(mtcars$mpg,L$residuals, xlab = "mpg", ylab = "Residuals")
abline(h = 0, col = "blue")
plot(mtcars$mpg, mtcars$wt)
summary(L)
The test statistic for slope is -9.559 with p-value 1.29e-10. Hence we strongly reject \(H_{0}\) and conclude that a linear model seems very appropriate for weight on mpg.
We can use hypothesis tests to test the assumptions of other hypothesis tests!
Here we illustrate using the car weights.
hist(mtcars$wt, main = "Histogram of Car Weights", freq=F)
help(qqnorm)
qqnorm(mtcars$wt, main = "QQ-plot for Car weights")
qqline(mtcars$wt, col = "blue")
Warning! The Shapiro-Wilks normality test is sensitive to sample size. Small samples will nearly always pass normality tests. Therefore, it is recomended to use both visual inspection and a significance test to decide whether a dataset is normally distributed.
T,P&C:
shapiro.test(mtcars$wt)
Note: The p-value is much greater than 0.05, so strong evidence that the data is normally distributed.
boxplot(extra~group, data = sleep)
Note: It is clear from the boxplots that the variation in group 1 is less than group 2.
Boxplots are a very useful tool as they provide an indication of whether the sample means are similar or different, whether the samples are normally distributed, and whether the variances are similar between the 2 samples.
H:
Let \(\sigma_1^2\) = variance of the group 1 drug.
Let \(\sigma_2^2\) = variance of the group 2 drug.
\(H_{0}\) : There is no difference: \(\sigma_1^2=\sigma_2^2\)
\(H_{1}\) : There is a difference: \(\sigma_1^2 \neq \sigma_2^2\)
We hope to retain \(H_{0}\)!
T,P&C
var.test(extra~group, data = sleep)
Note: From the variance test, it is clear that the p-value is much greater than 0.05. Therefore we retain the null hypothesis and conclude that the variances seem to be equal.
Unfortunately this can happen on windows computers.You can fix it with the following code
names(smoking)[1] = “Age”
This code looks at the column names, picks the first one, and then reassigns it to Age.
data = read.xls( "filename" , sheet = 3, header = TRUE)
<!– ## How do I produce a residual plot with missing values in data?
library(ggplot2)
help(msleep)
plot(sleep$bodywt, sleep$brainwt)
L = lm(sleep$brainwt ~ sleep$bodywt)
summary(L)
abline(L)
# Correlation (with missing values)
length(sleep$bodywt)
length(sleep$brainwt)
cor(sleep$bodywt, sleep$brainwt, use = "complete.obs")
# Residual plot (with missing values)
length(sleep$bodywt)
length(L$residuals)
residuals1 = resid(lm(sleep$brainwt ~ sleep$bodywt, na.action=na.exclude))
length(residuals1)
plot(sleep$bodywt,residuals1)
–