University of Sydney

Example: Childrens TV

In a study of the television viewing habits of children, a developmental psychologist selects a random sample of 100 boys and 200 girls of preschool age. Each child is asked which of the following TV programs they like best: Sesame Street or Play School. Results are shown in the table below:

\[ \begin{array}{l|cc|c} & \mbox{Sesame Street} & \mbox{Play School} & \mbox{Row Total} \\ \hline \mbox{Boys} & 42 & 58 & 100 \\ \mbox{Girls} & 86 & 114 & 200 \\ \hline \mbox{Column Total} & 128 & 172 & 300 \\ \end{array} \]

Is there any evidence that the viewing preferences of boys and girls are different?

Test of homogeneity

  • Suppose that several samples are taken from two independent populations, each of which is categorized according to the same set of variables.

  • We want to test whether the probability distributions (proportions) of the categories are the same over the different populations.

Two way contigency table

\[ \begin{array}{l|cc|c} & \mbox{Sesame Street} & \mbox{Play School} & \mbox{Row Total} \\ \hline \mbox{Boys} & 42 & 58 & 100 \\ \mbox{Girls} & 86 & 114 & 200 \\ \hline \mbox{Column Total} & 128 & 172 & 300 \\ \end{array} \]

  • A contingency table allows us to tabulate data from multiple categorical variables.

  • Contingency tables are heavily used in survey research, business intelligence, engineering and scientific research.

  • We call the above table a two-way a contingency table, specifically a \(2 \ \times \ 2\) contingency table.

Two way contigency table in R

c(head(Sex),tail(Sex))
##  [1] "Boys"  "Boys"  "Boys"  "Boys"  "Boys"  "Boys"  "Girls" "Girls"
##  [9] "Girls" "Girls" "Girls" "Girls"
c(head(Preferences),tail(Preferences))
##  [1] "Play School" "Play School" "Sesame St"   "Play School" "Play School"
##  [6] "Play School" "Sesame St"   "Play School" "Sesame St"   "Play School"
## [11] "Play School" "Play School"
table(Sex, Preferences)
##        Preferences
## Sex     Play School Sesame St
##   Boys           58        42
##   Girls         114        86

Test of homogeneity

Notation: we have \(y_{.1} = y_{11} + y_{21}\)

\[ \begin{array}{l|cc|c} & \mbox{Sesame Street} & \mbox{Play School} & \mbox{Row Total (fixed)} \\ \hline \mbox{Boys} & y_{11} & y_{12} & y_{1.} = \sum_{j=1}^2 y_{1j} = n_1\\ \mbox{Girls} & y_{21} & y_{22} & y_{2.} = \sum_{j=1}^2 y_{2j} = n_2 \\ \hline \mbox{Column Total} & y_{.1} & y_{.2} & n = n_1 + n_2\\ \end{array} \]

Test of homogeneity


\[ \begin{array}{l|cc|c} & \mbox{Sesame Street} & \mbox{Play School} & \mbox{Row Total (fixed)} \\ \hline \mbox{Boys} & p_{11} & p_{12} & p_{1.} = \sum_{j=1}^2 p_{1j} = 1\\ \mbox{Girls} & p_{21} & p_{22} & p_{2.} = \sum_{j=1}^2 p_{2j} = 1 \\ \hline \mbox{Column Total} & & & \\ \end{array} \]

  • Under the null hypothesis of homogeneity we have \(p_{11} = p_{21}\) and \(p_{12} = p_{22}\) so

Test of homogeneity


\[ \begin{array}{l|cc|c} & \mbox{Sesame Street} & \mbox{Play School} & \mbox{Row Total (fixed)} \\ \hline \mbox{Boys} & y_{11} & y_{12} & y_{1.} = \sum_{j=1}^2 y_{1j} = n_1\\ \mbox{Girls} & y_{21} & y_{22} & y_{2.} = \sum_{j=1}^2 y_{2j} = n_2 \\ \hline \mbox{Column Total} & y_{.1} & y_{.2} & n = n_1 + n_2\\ \end{array} \]

Under the null hypothesis of homogeneity we have \(p_{11} = p_{21}\) and \(p_{12} = p_{22}\) so

\[\hat p_{ij} = y_{.j}/n\].

Chi-square test of homogeneity

\[ \begin{eqnarray*} T &=& \sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-n_i\hat p_{ij})^2}{n_i\hat p_{ij}} \\ &=&\sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-y_{i.}\hat p_{ij})^2}{y_{i.}\hat p_{ij}} \\ &=&\sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-y_{i.}y_{.j}/n)^2}{y_{i.}y_{.j}/n}. \end{eqnarray*} \]


  • where \(T \sim \chi^2_1\)

Hypothesis testing workflow

The Chi-square test of homogeneity for a \(2 \ \times \ 2\) contigency table is:

  • Hypothesis:         \(H_0:\) \(p_{11} = p_{21} \ \& \ p_{12} = p_{22}\) vs \(H_1\): \(p_{11} \ne p_{21} \ \& \ p_{12} \ne p_{22}\)
  • Test statistic:       \(T = \sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-y_{i.}y_{.j}/n)^2}{y_{i.}y_{.j}/n}\)
  • Assumptions:     \(E_i=y_{i.}y_{.j}/n \ge 5\). Under \(H_0\), \(T \sim \chi_{1}^2\) approx.
  • P-value:         \(P(T \ge t) \ = \ P(\chi_{1}^2 \ge t)\)
  • Decision:           Reject \(H_0\) if the \(p\)-value \(< \alpha\)

Example: Childrens TV

In a study of the television viewing habits of children, a developmental psychologist selects a random sample of 100 boys and 200 girls of preschool age. Each child is asked which of the following TV programs they like best: Sesame Street or Play School. Results are shown in the table below:

\[ \begin{array}{l|cc|c} & \mbox{Sesame Street} & \mbox{Play School} & \mbox{Row Total} \\ \hline \mbox{Boys} & 42 & 58 & 100 \\ \mbox{Girls} & 86 & 114 & 200 \\ \hline \mbox{Column Total} & 128 & 172 & 300 \\ \end{array} \]

Is there any evidence that the viewing preferences of boys and girls are different?

Example: Childrens TV

The Chi-square test of homogeneity of the viewing habits of boys and girls is:

  • Hypothesis:         \(H_0:\) \(p_{11} = p_{21} \ \& \ p_{12} = p_{22}\) vs \(H_1\): \(p_{11} \ne p_{21} \ \& \ p_{12} \ne p_{22}\)
  • Test statistic:       \(T = \sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-y_{i.}y_{.j}/n)^2}{y_{i.}y_{.j}/n} = 0.027\)
  • Assumptions:     \(E_i=y_{i.}y_{.j}/n \ge 5\). Under \(H_0\), \(T \sim \chi_{1}^2\) approx.
  • P-value:         \(P(T \ge t) \ = \ P(\chi_{1}^2 \ge 0.027) = 0.87\)
  • Decision:           There is not enough evidence to reject the null hypothesis

In R

y.mat=table(Sex, Preferences)
n = sum(y.mat)
r = c = 2
y.mat
##        Preferences
## Sex     Play School Sesame St
##   Boys           58        42
##   Girls         114        86
chisq.test(y.mat,correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  y.mat
## X-squared = 0.027253, df = 1, p-value = 0.8689

In R

yr=apply(y.mat,1,sum) #checking
yr
##  Boys Girls 
##   100   200
yc=apply(y.mat,2,sum) # Or try colSums()
yc
## Play School   Sesame St 
##         172         128
(yr.mat=matrix(yr,r,c,byrow=F))
##      [,1] [,2]
## [1,]  100  100
## [2,]  200  200
(yc.mat=matrix(yc,r,c,byrow=T))
##      [,1] [,2]
## [1,]  172  128
## [2,]  172  128
(ey.mat=yr.mat*yc.mat/n) # Or try ey.mat = yr%*%t(yc)/n
##           [,1]     [,2]
## [1,]  57.33333 42.66667
## [2,] 114.66667 85.33333

In R

ey.mat>=5 #test Eij>=5
##      [,1] [,2]
## [1,] TRUE TRUE
## [2,] TRUE TRUE
(chi=sum((y.mat-ey.mat)^2/ey.mat))
## [1] 0.02725291
(p.value=1-pchisq(chi,1))
## [1] 0.8688774

Tests for independence

Example: Titanic

  • The 'Titanic' dataset comes preloaded in R.

  • This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner 'Titanic', summarized according to economic status (class), sex, age and survival.

  • Is there any evidence that being a women increased your chance of survival on the ship?

x = as.data.frame(Titanic)
y.mat = xtabs(Freq ~ Sex + Survived,x)
y.mat
##         Survived
## Sex        No  Yes
##   Male   1364  367
##   Female  126  344

Tests for independence

  • There are times where a sample may be categorized according to two or more factors.

  • It is of interest to know whether the factors for the classification are independent.

Frequency table


\[ \begin{array}{l|cc|c} & \mbox{Survived} & \mbox{Did not survive} & \mbox{Row Total} \\ \hline \mbox{Male} & y_{11} & y_{12} & y_{1.}\\ \mbox{Female} & y_{21} & y_{22} & y_{2.}\\ \hline \mbox{Column Total} & y_{.1} & y_{.2} & n\\ \end{array} \]

Table of proportions

Let \(p_{ij}\) denote the probability of an observation falling in the \((i,j)^{th}\) category.

The marginal row and column probabilities are respectively:

\[ p_{i.}=\sum_{j=1}^2 p_{ij}\quad \mbox{and}\quad p_{.j}=\sum_{i=1}^2p_{ij} \]

\[ \begin{array}{l|cc|c} & \mbox{Survived} & \mbox{Did not survive} & \mbox{Row Total} \\ \hline \mbox{Male} & p_{11} & p_{12} & p_{1.}\\ \mbox{Female} & p_{21} & p_{22} & p_{2.}\\ \hline \mbox{Column Total} & p_{.1} & p_{.2} & 1\\ \end{array} \]

Independence



Statistical thinking:

  • What does independence mean?
  • Think of two things that are independent, explain why they are independent.
  • How do you know they are independent?

Independence

\(X\) and \(Y\) are said to be independent if

\[P(X=x|Y=y) \ = \ P(X=x)\]

or

\[P(X=x, Y=y) \ = \ P(X=x)P(Y=y)\]



\[ \begin{array}{l|cc|c} & \mbox{Survived} & \mbox{Did not survive} & \mbox{Row Total} \\ \hline \mbox{Male} & p_{11} & p_{12} & p_{1.}\\ \mbox{Female} & p_{21} & y_{22} & p_{2.}\\ \hline \mbox{Column Total} & p_{.1} & p_{.2} & 1\\ \end{array} \]

Expected frequency

Under \(H_0\) of independence, the expected frequency should be \(n\,p_{ij}=n\,p_{i.}\, p_{.j}\). Hence \[ T=\sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-n\,p_{i.}\, p_{.j})^2}{n\,p_{i.}\, p_{.j}} \] will be large if we should reject \(H_0\).

However \(T\) includes unknown parameters \(p_{i.}\) and \(p_{.j}\).

Estimating expected frequency

We can therefore estimate \(p_{i.}\) and \(p_{.j}\) by \[ \hat p_{i.}= y_{i.}/n,\qquad \hat p_{.j}= y_{.j}/n. \]

Hence, we may use the test statistic \[ T=\sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-n\,\hat p_{i.}\, \hat p_{.j})^2}{n\, \hat p_{i.}\, \hat p_{.j}} =\sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-y_{i.}y_{.j}/n)^2}{y_{i.}y_{.j}/n}. \]

Hypothesis testing workflow

The five steps of the test for independence between the two factors are:

  • Hypothesis:         \(H_0:\) \(p_{ij} = p_{i.}p_{.j}, \quad i=1,2; \ j=1,2\) vs \(H_1:\) Not all equalities hold.
  • Test statistic:       \(T = \sum_{i=1}^2\sum_{j=1}^2\frac{(y_{ij}-y_{i.}y_{.j}/n)^2}{y_{i.}y_{.j}/n}\)
  • Assumptions:     \(E_i=y_{i.}y_{.j}/n \ge 5\). Under \(H_0\), \(T \sim \chi_{4-1-2}^2\) approx.
  • P-value:         \(P(T \ge t) \ = \ P(\chi_{1}^2 \ge t)\)
  • Decision:           Reject \(H_0\) if the \(p\)-value \(< \alpha\)

In R

y.mat
##         Survived
## Sex        No  Yes
##   Male   1364  367
##   Female  126  344
chisq.test(y.mat,correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  y.mat
## X-squared = 456.87, df = 1, p-value < 2.2e-16

In R

yr=apply(y.mat,1,sum) # Or try rowSums()
yr
##   Male Female 
##   1731    470
yc=apply(y.mat,2,sum) # Or try colSums()
yc
##   No  Yes 
## 1490  711
(yr.mat=matrix(yr,r,c,byrow=F))
##      [,1] [,2]
## [1,] 1731 1731
## [2,]  470  470
(yc.mat=matrix(yc,r,c,byrow=T))
##      [,1] [,2]
## [1,] 1490  711
## [2,] 1490  711
(ey.mat=yr.mat*yc.mat/sum(y.mat)) # Or try ey.mat = yr%*%t(yc)/n
##           [,1]     [,2]
## [1,] 1171.8264 559.1736
## [2,]  318.1736 151.8264

In R

ey.mat>=5 #test Eij>=5
##      [,1] [,2]
## [1,] TRUE TRUE
## [2,] TRUE TRUE
(chi=sum((y.mat-ey.mat)^2/ey.mat))
## [1] 456.8742
(p.value=pchisq(chi,1,lower.tail=FALSE))
## [1] 2.302151e-101