Regression

University of Sydney

Recap

Simple linear regression

Example: Heights

Doctors and parents are often interested in predicting the eventual heights of their children. The following is a portion of the data taken from a study of the heights of boys.

\[ \begin{array}{c|cccccccccccccc} \mbox{Height at age two }(x_i) & 39 &30 &32 &34 &35 &36 &36 &30 &33 &37 &33 &38 &32 &35 \\ \mbox{Height as an adult }(y_i) &71 &63 &63 &67 &68 &68 &70 &64 &65 &68 &66 &70 &64 &69 \end{array} \]

Does knowing a two year old’s height inform what height they will be as an adult?

Example: Heights

x = c(39, 30, 32, 34, 35, 36, 36, 30, 33, 37, 33, 38, 32, 35)
y = c(71, 63, 63, 67, 68, 68, 70, 64, 65, 68, 66, 70, 64, 69)
plot(x, y)

Linear regression

With repeated experiments, one would find that values of \(Y\) often vary in a random manner. Hence the probabilistic model

\[ Y=\alpha+\beta\, X+\epsilon \]

where \(\epsilon\) is a random variable which follows a probability distribution with mean zero.

The model is generally called a (simple linear) regression model or the regression of \(Y\) on \(X\).
The variables \(\alpha\), \(\beta\) are called parameters and they are the intercept and slope (coefficient) of the linear regression model respectively.

Note

In practice, the regression function \(f(x)\) may be more complex than a simple linear function. For example, we may have

\[ \begin{eqnarray*} y = f(x)&=&\alpha+\beta_1 x+\beta_2 x^2, \quad \mbox{or} \\ y = f(x)&=& \alpha+\beta x^{\frac{1}{2}}, \quad \mbox{etc.} \end{eqnarray*} \]

Fitting a straight line by least squares

Note that for \(Y = \alpha + \beta X + \epsilon\) the expected value or the mean of \(Y_i\) given \(X=x_i\) under the model is \[ E(Y_i|X=x_i)=\hat y_i= \hat \alpha+ \hat \beta\, x_i \]

and the errors or residuals of the model when \(Y_i=y_i\) is \[\epsilon_i=y_i - \hat y_i=y_i- \hat \alpha- \hat \beta x_i.\]

Linear regression

Drawing

When we say we have a linear regression for \(Y\), we mean that \(f(x)\) is a linear function of the unknown parameters \(\alpha, \beta\), etc, not necessarily a linear function of \(x\).

Example: Heights

Doctors and parents are often interested in predicting the eventual heights of their children. The following is a portion of the data taken from a study of the heights of boys.

Does knowing a two year olds height inform what height they will be as an adult?

In R

x = c(39, 30, 32, 34, 35, 36, 36, 30, 33, 37, 33, 38, 32, 35)
y = c(71, 63, 63, 67, 68, 68, 70, 64, 65, 68, 66, 70, 64, 69)
(fit = lm(y ~ x))

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     35.7280       0.9079

summary(fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7819 -0.6208 -0.0517  0.4713  1.5864 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.7280     3.5053  10.193 2.91e-07 ***
## x             0.9079     0.1019   8.908 1.23e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.024 on 12 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.8577 
## F-statistic: 79.35 on 1 and 12 DF,  p-value: 1.231e-06

Model assumptions

Linearity of data, i.e. \(y_i=\alpha + \beta x_i+\epsilon_i\),
Equality of variance, i.e. a common \(\sigma^2\) independent of \(x_i\),
Independence of residuals, i.e. \(\epsilon_i\) and \(\epsilon_j\) are independent and
Normality of residuals, i.e. \(\epsilon_i \sim {\mathcal N}(0,\sigma^2)\).

Model assumptions

These assumptions may be checked by using

Linearity of data: the fitted line plot of \(y_{i}\),
Equality of variance: the residual plot of \(\epsilon_i=y_i-\hat y_i\) against the fitted values \(\hat y_i\),
Independence: the residual plot shows random scatter with no pattern, and
Normality of residuals: the normal qq-plot of the residuals \(\epsilon_i\).

Violation of assumptions on residuals:

Drawing

Verifying assumptions in R

plot(fit)

\(R^2\)

\(r^2\) is the coefficient of determination.
The proportion of variance in \(y\) explained by \(x\).

summary(fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7819 -0.6208 -0.0517  0.4713  1.5864 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.7280     3.5053  10.193 2.91e-07 ***
## x             0.9079     0.1019   8.908 1.23e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.024 on 12 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.8577 
## F-statistic: 79.35 on 1 and 12 DF,  p-value: 1.231e-06

\(R^2\)

Note that \[ \begin{eqnarray*} r&=& \frac{S_{xy}}{\sqrt {S_{xx}\,S_{yy}}}\\ && \\ r^2&=& \frac{S_{xy}^2}{S_{xx}\,S_{yy}} = \frac{S_{xy}^2/S_{xx}}{S_{yy}} = \frac {SST}{SST_o} \end{eqnarray*} \]

Recap

Simple linear regression

Example: Heights

Example: Heights

Linear regression

Note

Fitting a straight line by least squares

Linear regression

Example: Heights

In R

Model assumptions

Model assumptions

Violation of assumptions on residuals:

Verifying assumptions in R

\(R^2\)

\(R^2\)

Summary