Tests of independence and homogeneity

University of Sydney

Example: Childrens TV

In a study of the television viewing habits of children, a developmental psychologist selects a random sample of 100 boys and 200 girls of preschool age. Each child is asked which of the following TV programs they like best: Sesame Street or Play School. Results are shown in the table below:

\[ \begin{array}{l|cc|c} & \mbox{Sesame Street} & \mbox{Play School} & \mbox{Row Total} \\ \hline \mbox{Boys} & 42 & 58 & 100 \\ \mbox{Girls} & 86 & 114 & 200 \\ \hline \mbox{Column Total} & 128 & 172 & 300 \\ \end{array} \]

Is there any evidence that the viewing preferences of boys and girls are different?

Test of homogeneity

Suppose that several samples are taken from two independent populations, each of which is categorized according to the same set of variables.
We want to test whether the probability distributions (proportions) of the categories are the same over the different populations.

Two way contigency table

A contingency table allows us to tabulate data from multiple categorical variables.
Contingency tables are heavily used in survey research, business intelligence, engineering and scientific research.
We call the above table a two-way a contingency table, specifically a \(2 \ \times \ 2\) contingency table.

Two way contigency table in R

c(head(Sex),tail(Sex))

##  [1] "Boys"  "Boys"  "Boys"  "Boys"  "Boys"  "Boys"  "Girls" "Girls"
##  [9] "Girls" "Girls" "Girls" "Girls"

c(head(Preferences),tail(Preferences))

##  [1] "Play School" "Play School" "Sesame St"   "Play School" "Play School"
##  [6] "Play School" "Sesame St"   "Play School" "Sesame St"   "Play School"
## [11] "Play School" "Play School"

table(Sex, Preferences)

##        Preferences
## Sex     Play School Sesame St
##   Boys           58        42
##   Girls         114        86

Test of homogeneity

Notation: we have \(y_{.1} = y_{11} + y_{21}\)

\[ \begin{array}{l|cc|c} & \mbox{Sesame Street} & \mbox{Play School} & \mbox{Row Total (fixed)} \\ \hline \mbox{Boys} & y_{11} & y_{12} & y_{1.} = \sum_{j=1}^2 y_{1j} = n_1\\ \mbox{Girls} & y_{21} & y_{22} & y_{2.} = \sum_{j=1}^2 y_{2j} = n_2 \\ \hline \mbox{Column Total} & y_{.1} & y_{.2} & n = n_1 + n_2\\ \end{array} \]

Test of homogeneity

\[ \begin{array}{l|cc|c} & \mbox{Sesame Street} & \mbox{Play School} & \mbox{Row Total (fixed)} \\ \hline \mbox{Boys} & p_{11} & p_{12} & p_{1.} = \sum_{j=1}^2 p_{1j} = 1\\ \mbox{Girls} & p_{21} & p_{22} & p_{2.} = \sum_{j=1}^2 p_{2j} = 1 \\ \hline \mbox{Column Total} & & & \\ \end{array} \]

Under the null hypothesis of homogeneity we have \(p_{11} = p_{21}\) and \(p_{12} = p_{22}\) so

Test of homogeneity

Under the null hypothesis of homogeneity we have \(p_{11} = p_{21}\) and \(p_{12} = p_{22}\) so

\[\hat p_{ij} = y_{.j}/n\].

Chi-square test of homogeneity

\[ \begin{eqnarray*} T &=& \sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-n_i\hat p_{ij})^2}{n_i\hat p_{ij}} \\ &=&\sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-y_{i.}\hat p_{ij})^2}{y_{i.}\hat p_{ij}} \\ &=&\sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-y_{i.}y_{.j}/n)^2}{y_{i.}y_{.j}/n}. \end{eqnarray*} \]

where \(T \sim \chi^2_1\)

Hypothesis testing workflow

The Chi-square test of homogeneity for a \(2 \ \times \ 2\) contigency table is:

Hypothesis: \(H_0:\) \(p_{11} = p_{21} \ \& \ p_{12} = p_{22}\) vs \(H_1\): \(p_{11} \ne p_{21} \ \& \ p_{12} \ne p_{22}\)

Test statistic: \(T = \sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-y_{i.}y_{.j}/n)^2}{y_{i.}y_{.j}/n}\)

Assumptions: \(E_i=y_{i.}y_{.j}/n \ge 5\). Under \(H_0\), \(T \sim \chi_{1}^2\) approx.

P-value: \(P(T \ge t) \ = \ P(\chi_{1}^2 \ge t)\)

Decision: Reject \(H_0\) if the \(p\)-value \(< \alpha\)

Example: Childrens TV

Is there any evidence that the viewing preferences of boys and girls are different?

Example: Childrens TV

The Chi-square test of homogeneity of the viewing habits of boys and girls is:

Hypothesis: \(H_0:\) \(p_{11} = p_{21} \ \& \ p_{12} = p_{22}\) vs \(H_1\): \(p_{11} \ne p_{21} \ \& \ p_{12} \ne p_{22}\)

Test statistic: \(T = \sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-y_{i.}y_{.j}/n)^2}{y_{i.}y_{.j}/n} = 0.027\)

Assumptions: \(E_i=y_{i.}y_{.j}/n \ge 5\). Under \(H_0\), \(T \sim \chi_{1}^2\) approx.

P-value: \(P(T \ge t) \ = \ P(\chi_{1}^2 \ge 0.027) = 0.87\)

Decision: There is not enough evidence to reject the null hypothesis

In R

y.mat=table(Sex, Preferences)
n = sum(y.mat)
r = c = 2
y.mat

##        Preferences
## Sex     Play School Sesame St
##   Boys           58        42
##   Girls         114        86

chisq.test(y.mat,correct=FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  y.mat
## X-squared = 0.027253, df = 1, p-value = 0.8689

In R

yr=apply(y.mat,1,sum) #checking
yr

##  Boys Girls 
##   100   200

yc=apply(y.mat,2,sum) # Or try colSums()
yc

## Play School   Sesame St 
##         172         128

(yr.mat=matrix(yr,r,c,byrow=F))

##      [,1] [,2]
## [1,]  100  100
## [2,]  200  200

(yc.mat=matrix(yc,r,c,byrow=T))

##      [,1] [,2]
## [1,]  172  128
## [2,]  172  128

(ey.mat=yr.mat*yc.mat/n) # Or try ey.mat = yr%*%t(yc)/n

##           [,1]     [,2]
## [1,]  57.33333 42.66667
## [2,] 114.66667 85.33333

In R

ey.mat>=5 #test Eij>=5

##      [,1] [,2]
## [1,] TRUE TRUE
## [2,] TRUE TRUE

(chi=sum((y.mat-ey.mat)^2/ey.mat))

## [1] 0.02725291

(p.value=1-pchisq(chi,1))

## [1] 0.8688774

Tests for independence

Example: Titanic

The 'Titanic' dataset comes preloaded in R.
This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner 'Titanic', summarized according to economic status (class), sex, age and survival.
Is there any evidence that being a women increased your chance of survival on the ship?

x = as.data.frame(Titanic)
y.mat = xtabs(Freq ~ Sex + Survived,x)
y.mat

##         Survived
## Sex        No  Yes
##   Male   1364  367
##   Female  126  344

Tests for independence

There are times where a sample may be categorized according to two or more factors.
It is of interest to know whether the factors for the classification are independent.

Frequency table

\[ \begin{array}{l|cc|c} & \mbox{Survived} & \mbox{Did not survive} & \mbox{Row Total} \\ \hline \mbox{Male} & y_{11} & y_{12} & y_{1.}\\ \mbox{Female} & y_{21} & y_{22} & y_{2.}\\ \hline \mbox{Column Total} & y_{.1} & y_{.2} & n\\ \end{array} \]

Table of proportions

Let \(p_{ij}\) denote the probability of an observation falling in the \((i,j)^{th}\) category.

The marginal row and column probabilities are respectively:

\[ p_{i.}=\sum_{j=1}^2 p_{ij}\quad \mbox{and}\quad p_{.j}=\sum_{i=1}^2p_{ij} \]

\[ \begin{array}{l|cc|c} & \mbox{Survived} & \mbox{Did not survive} & \mbox{Row Total} \\ \hline \mbox{Male} & p_{11} & p_{12} & p_{1.}\\ \mbox{Female} & p_{21} & p_{22} & p_{2.}\\ \hline \mbox{Column Total} & p_{.1} & p_{.2} & 1\\ \end{array} \]

Independence

Statistical thinking:

What does independence mean?
Think of two things that are independent, explain why they are independent.
How do you know they are independent?

Independence

\(X\) and \(Y\) are said to be independent if

\[P(X=x|Y=y) \ = \ P(X=x)\]

\[P(X=x, Y=y) \ = \ P(X=x)P(Y=y)\]

\[ \begin{array}{l|cc|c} & \mbox{Survived} & \mbox{Did not survive} & \mbox{Row Total} \\ \hline \mbox{Male} & p_{11} & p_{12} & p_{1.}\\ \mbox{Female} & p_{21} & y_{22} & p_{2.}\\ \hline \mbox{Column Total} & p_{.1} & p_{.2} & 1\\ \end{array} \]

Expected frequency

Under \(H_0\) of independence, the expected frequency should be \(n\,p_{ij}=n\,p_{i.}\, p_{.j}\). Hence \[ T=\sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-n\,p_{i.}\, p_{.j})^2}{n\,p_{i.}\, p_{.j}} \] will be large if we should reject \(H_0\).

However \(T\) includes unknown parameters \(p_{i.}\) and \(p_{.j}\).

Estimating expected frequency

We can therefore estimate \(p_{i.}\) and \(p_{.j}\) by \[ \hat p_{i.}= y_{i.}/n,\qquad \hat p_{.j}= y_{.j}/n. \]

Hence, we may use the test statistic \[ T=\sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-n\,\hat p_{i.}\, \hat p_{.j})^2}{n\, \hat p_{i.}\, \hat p_{.j}} =\sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-y_{i.}y_{.j}/n)^2}{y_{i.}y_{.j}/n}. \]

Hypothesis testing workflow

The five steps of the test for independence between the two factors are:

Hypothesis: \(H_0:\) \(p_{ij} = p_{i.}p_{.j}, \quad i=1,2; \ j=1,2\) vs \(H_1:\) Not all equalities hold.

Test statistic: \(T = \sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-y_{i.}y_{.j}/n)^2}{y_{i.}y_{.j}/n}\)

Assumptions: \(E_i=y_{i.}y_{.j}/n \ge 5\). Under \(H_0\), \(T \sim \chi_{4-1-2}^2\) approx.

P-value: \(P(T \ge t) \ = \ P(\chi_{1}^2 \ge t)\)

Decision: Reject \(H_0\) if the \(p\)-value \(< \alpha\)

In R

y.mat

##         Survived
## Sex        No  Yes
##   Male   1364  367
##   Female  126  344

chisq.test(y.mat,correct=FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  y.mat
## X-squared = 456.87, df = 1, p-value < 2.2e-16

In R

yr=apply(y.mat,1,sum) # Or try rowSums()
yr

##   Male Female 
##   1731    470

yc=apply(y.mat,2,sum) # Or try colSums()
yc

##   No  Yes 
## 1490  711

(yr.mat=matrix(yr,r,c,byrow=F))

##      [,1] [,2]
## [1,] 1731 1731
## [2,]  470  470

(yc.mat=matrix(yc,r,c,byrow=T))

##      [,1] [,2]
## [1,] 1490  711
## [2,] 1490  711

(ey.mat=yr.mat*yc.mat/sum(y.mat)) # Or try ey.mat = yr%*%t(yc)/n

##           [,1]     [,2]
## [1,] 1171.8264 559.1736
## [2,]  318.1736 151.8264

In R

ey.mat>=5 #test Eij>=5

##      [,1] [,2]
## [1,] TRUE TRUE
## [2,] TRUE TRUE

(chi=sum((y.mat-ey.mat)^2/ey.mat))

## [1] 456.8742

(p.value=pchisq(chi,1,lower.tail=FALSE))

## [1] 2.302151e-101