University of Sydney
Doctors and parents are often interested in predicting the eventual heights of their children. The following is a portion of the data taken from a study of the heights of boys.
\[ \begin{array}{c|cccccccccccccc} \mbox{Height at age two }(x_i) & 39 &30 &32 &34 &35 &36 &36 &30 &33 &37 &33 &38 &32 &35 \\ \mbox{Height as an adult }(y_i) &71 &63 &63 &67 &68 &68 &70 &64 &65 &68 &66 &70 &64 &69 \end{array} \]
Does knowing a two year old’s height inform what height they will be as an adult?
x = c(39, 30, 32, 34, 35, 36, 36, 30, 33, 37, 33, 38, 32, 35) y = c(71, 63, 63, 67, 68, 68, 70, 64, 65, 68, 66, 70, 64, 69) plot(x, y)
With repeated experiments, one would find that values of \(Y\) often vary in a random manner. Hence the probabilistic model
\[ Y=\alpha+\beta\, X+\epsilon \]
where \(\epsilon\) is a random variable which follows a probability distribution with mean zero.
The model is generally called a (simple linear) regression model or the regression of \(Y\) on \(X\).
The variables \(\alpha\), \(\beta\) are called parameters and they are the intercept and slope (coefficient) of the linear regression model respectively.
In practice, the regression function \(f(x)\) may be more complex than a simple linear function. For example, we may have
\[ \begin{eqnarray*} y = f(x)&=&\alpha+\beta_1 x+\beta_2 x^2, \quad \mbox{or} \\ y = f(x)&=& \alpha+\beta x^{\frac{1}{2}}, \quad \mbox{etc.} \end{eqnarray*} \]
Note that for \(Y = \alpha + \beta X + \epsilon\) the expected value or the mean of \(Y_i\) given \(X=x_i\) under the model is \[ E(Y_i|X=x_i)=\hat y_i= \hat \alpha+ \hat \beta\, x_i \]
and the errors or residuals of the model when \(Y_i=y_i\) is \[\epsilon_i=y_i - \hat y_i=y_i- \hat \alpha- \hat \beta x_i.\]
When we say we have a linear regression for \(Y\), we mean that \(f(x)\) is a linear function of the unknown parameters \(\alpha, \beta\), etc, not necessarily a linear function of \(x\).
Doctors and parents are often interested in predicting the eventual heights of their children. The following is a portion of the data taken from a study of the heights of boys.
\[ \begin{array}{c|cccccccccccccc} \mbox{Height at age two }(x_i) & 39 &30 &32 &34 &35 &36 &36 &30 &33 &37 &33 &38 &32 &35 \\ \mbox{Height as an adult }(y_i) &71 &63 &63 &67 &68 &68 &70 &64 &65 &68 &66 &70 &64 &69 \end{array} \]
Does knowing a two year olds height inform what height they will be as an adult?
x = c(39, 30, 32, 34, 35, 36, 36, 30, 33, 37, 33, 38, 32, 35) y = c(71, 63, 63, 67, 68, 68, 70, 64, 65, 68, 66, 70, 64, 69) (fit = lm(y ~ x))
## ## Call: ## lm(formula = y ~ x) ## ## Coefficients: ## (Intercept) x ## 35.7280 0.9079
summary(fit)
## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.7819 -0.6208 -0.0517 0.4713 1.5864 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 35.7280 3.5053 10.193 2.91e-07 *** ## x 0.9079 0.1019 8.908 1.23e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.024 on 12 degrees of freedom ## Multiple R-squared: 0.8686, Adjusted R-squared: 0.8577 ## F-statistic: 79.35 on 1 and 12 DF, p-value: 1.231e-06
Linearity of data, i.e. \(y_i=\alpha + \beta x_i+\epsilon_i\),
Equality of variance, i.e. a common \(\sigma^2\) independent of \(x_i\),
Independence of residuals, i.e. \(\epsilon_i\) and \(\epsilon_j\) are independent and
Normality of residuals, i.e. \(\epsilon_i \sim {\mathcal N}(0,\sigma^2)\).
These assumptions may be checked by using
Linearity of data: the fitted line plot of \(y_{i}\),
Equality of variance: the residual plot of \(\epsilon_i=y_i-\hat y_i\) against the fitted values \(\hat y_i\),
Independence: the residual plot shows random scatter with no pattern, and
Normality of residuals: the normal qq-plot of the residuals \(\epsilon_i\).
plot(fit)
\(r^2\) is the coefficient of determination.
The proportion of variance in \(y\) explained by \(x\).
summary(fit)
## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.7819 -0.6208 -0.0517 0.4713 1.5864 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 35.7280 3.5053 10.193 2.91e-07 *** ## x 0.9079 0.1019 8.908 1.23e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.024 on 12 degrees of freedom ## Multiple R-squared: 0.8686, Adjusted R-squared: 0.8577 ## F-statistic: 79.35 on 1 and 12 DF, p-value: 1.231e-06
Note that \[ \begin{eqnarray*} r&=& \frac{S_{xy}}{\sqrt {S_{xx}\,S_{yy}}}\\ && \\ r^2&=& \frac{S_{xy}^2}{S_{xx}\,S_{yy}} = \frac{S_{xy}^2/S_{xx}}{S_{yy}} = \frac {SST}{SST_o} \end{eqnarray*} \]