University of Sydney

Recap

Simple linear regression

Example: Heights


Doctors and parents are often interested in predicting the eventual heights of their children. The following is a portion of the data taken from a study of the heights of boys.


\[ \begin{array}{c|cccccccccccccc} \mbox{Height at age two }(x_i) & 39 &30 &32 &34 &35 &36 &36 &30 &33 &37 &33 &38 &32 &35 \\ \mbox{Height as an adult }(y_i) &71 &63 &63 &67 &68 &68 &70 &64 &65 &68 &66 &70 &64 &69 \end{array} \]


Does knowing a two year old’s height inform what height they will be as an adult?

Example: Heights

x = c(39, 30, 32, 34, 35, 36, 36, 30, 33, 37, 33, 38, 32, 35)
y = c(71, 63, 63, 67, 68, 68, 70, 64, 65, 68, 66, 70, 64, 69)
plot(x, y)

Linear regression

With repeated experiments, one would find that values of \(Y\) often vary in a random manner. Hence the probabilistic model

\[ Y=\alpha+\beta\, X+\epsilon \]

where \(\epsilon\) is a random variable which follows a probability distribution with mean zero.

  • The model is generally called a (simple linear) regression model or the regression of \(Y\) on \(X\).

  • The variables \(\alpha\), \(\beta\) are called parameters and they are the intercept and slope (coefficient) of the linear regression model respectively.

Note

In practice, the regression function \(f(x)\) may be more complex than a simple linear function. For example, we may have

\[ \begin{eqnarray*} y = f(x)&=&\alpha+\beta_1 x+\beta_2 x^2, \quad \mbox{or} \\ y = f(x)&=& \alpha+\beta x^{\frac{1}{2}}, \quad \mbox{etc.} \end{eqnarray*} \]

Fitting a straight line by least squares


Note that for \(Y = \alpha + \beta X + \epsilon\) the expected value or the mean of \(Y_i\) given \(X=x_i\) under the model is \[ E(Y_i|X=x_i)=\hat y_i= \hat \alpha+ \hat \beta\, x_i \]


and the errors or residuals of the model when \(Y_i=y_i\) is \[\epsilon_i=y_i - \hat y_i=y_i- \hat \alpha- \hat \beta x_i.\]

Linear regression


Drawing

When we say we have a linear regression for \(Y\), we mean that \(f(x)\) is a linear function of the unknown parameters \(\alpha, \beta\), etc, not necessarily a linear function of \(x\).

Example: Heights

Doctors and parents are often interested in predicting the eventual heights of their children. The following is a portion of the data taken from a study of the heights of boys.

\[ \begin{array}{c|cccccccccccccc} \mbox{Height at age two }(x_i) & 39 &30 &32 &34 &35 &36 &36 &30 &33 &37 &33 &38 &32 &35 \\ \mbox{Height as an adult }(y_i) &71 &63 &63 &67 &68 &68 &70 &64 &65 &68 &66 &70 &64 &69 \end{array} \]

Does knowing a two year olds height inform what height they will be as an adult?

In R

x = c(39, 30, 32, 34, 35, 36, 36, 30, 33, 37, 33, 38, 32, 35)
y = c(71, 63, 63, 67, 68, 68, 70, 64, 65, 68, 66, 70, 64, 69)
(fit = lm(y ~ x))
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     35.7280       0.9079
summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7819 -0.6208 -0.0517  0.4713  1.5864 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.7280     3.5053  10.193 2.91e-07 ***
## x             0.9079     0.1019   8.908 1.23e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.024 on 12 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.8577 
## F-statistic: 79.35 on 1 and 12 DF,  p-value: 1.231e-06

Model assumptions


  1. Linearity of data, i.e. \(y_i=\alpha + \beta x_i+\epsilon_i\),

  2. Equality of variance, i.e. a common \(\sigma^2\) independent of \(x_i\),

  3. Independence of residuals, i.e. \(\epsilon_i\) and \(\epsilon_j\) are independent and

  4. Normality of residuals, i.e. \(\epsilon_i \sim {\mathcal N}(0,\sigma^2)\).

Model assumptions


These assumptions may be checked by using

  1. Linearity of data: the fitted line plot of \(y_{i}\),

  2. Equality of variance: the residual plot of \(\epsilon_i=y_i-\hat y_i\) against the fitted values \(\hat y_i\),

  3. Independence: the residual plot shows random scatter with no pattern, and

  4. Normality of residuals: the normal qq-plot of the residuals \(\epsilon_i\).

Violation of assumptions on residuals:


Drawing

Verifying assumptions in R

plot(fit)

\(R^2\)

  • \(r^2\) is the coefficient of determination.

  • The proportion of variance in \(y\) explained by \(x\).

summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7819 -0.6208 -0.0517  0.4713  1.5864 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.7280     3.5053  10.193 2.91e-07 ***
## x             0.9079     0.1019   8.908 1.23e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.024 on 12 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.8577 
## F-statistic: 79.35 on 1 and 12 DF,  p-value: 1.231e-06

\(R^2\)

Note that \[ \begin{eqnarray*} r&=& \frac{S_{xy}}{\sqrt {S_{xx}\,S_{yy}}}\\ && \\ r^2&=& \frac{S_{xy}^2}{S_{xx}\,S_{yy}} = \frac{S_{xy}^2/S_{xx}}{S_{yy}} = \frac {SST}{SST_o} \end{eqnarray*} \]

Summary