Name: Ellis Patrick
SID: 12345678
Dr P. J. Solomon and the Australian National Centre in HIV Epidemiology and Clinical Research performed a study on HIV transmission and survival. They recorded the following characteristics of all patients who entered a HIV clinical of the course of the study.
Variable | Description |
---|---|
state | Grouped state of origin: “NSW”includes ACT and “other” is WA, SA, NT and TAS. |
sex | Sex of patient. |
diag | (Julian) date of diagnosis. |
death | (Julian) date of death or end of observation. |
status | “A” (alive) or “D” (dead) at end of observation. |
T.categ | Reported transmission category. |
age | Age (years) at diagnosis. |
To load the data use the link http://www.maths.usyd.edu.au/u/ellisp/AMED3002/HIV.csv as follows:
HIV = read.csv('http://www.maths.usyd.edu.au/u/ellisp/AMED3002/HIV.csv')
Answer: There are 7 variables and 2843 observations
head(HIV)
## state sex diag death status T.categ age
## 1 NSW M 10905 11081 D hs 35
## 2 NSW M 11029 11096 D hs 53
## 3 NSW M 9551 9983 D hs 42
## 4 NSW M 9577 9654 D haem 44
## 5 NSW M 10015 10290 D hs 39
## 6 NSW M 9971 10344 D hs 36
dim(HIV)
## [1] 2843 7
Answer: All of the variables appear to be appropriately stored in R, representative of their actual class.
sapply(HIV,class)
## state sex diag death status T.categ age
## "factor" "factor" "integer" "integer" "factor" "factor" "integer"
Answer: No.
sum(is.na(HIV))
## [1] 0
The researches are interested in the characteristics for the pople being diagnosed in each state. Specifically, they would like to know whether there were differing numbers of men and women that were diagonsed in each state.
Answer: Sex and state should both be categorical.
Answer: A chi-square test.
Answer: The null is that there Sex and state is no relationship between sex and state, the alternate is that there is a relationship.
Answer: There are alot more men than women in the data set. NSW appears to have the most subjects.
table(HIV$state,HIV$sex)
##
## F M
## NSW 54 1726
## Other 13 236
## QLD 9 217
## VIC 13 575
chisq.test(table(HIV$state,HIV$sex))
##
## Pearson's Chi-squared test
##
## data: table(HIV$state, HIV$sex)
## X-squared = 5.8235, df = 3, p-value = 0.1205
Answer: As the p-value is greater than 0.05 I would conclude that there is not enough evidence to reject the hypothesis that there is no relationship between sex and state.
Answer: We assume that all the observations are independent and that the expected count for each cell is greater than five.
chisq.test(table(HIV$state,HIV$sex))$expected
##
## F M
## NSW 55.722828 1724.2772
## Other 7.794935 241.2051
## QLD 7.074921 218.9251
## VIC 18.407316 569.5927
The researchers would like to know if there is some difference between the states in the outcomes for HIV patients. They decide that they would like to see if the time between diagnosis and death are different between states.
HIVD = HIV[HIV$status=='D',]
S = HIVD$death-HIVD$diag
Answer: There might be a few outliers in all of the states. The variances of each state are reasonably similar. The data could be right skewed for each state. The median of queensland appears to be the lowest and victoria the highest.
boxplot(S~HIVD$state)
fit =aov(S~HIVD$state)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## HIVD$state 3 1477348 492449 5.124 0.00157 **
## Residuals 1757 168870047 96113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer: The p-value is less that 0.01 so we have enough evidence to reject the hypothesis that time between diagnosis and death are the same between states.
Answer: We assume independence, normality and equal variances. The variances appear equal however in the qqplot there might be evidence that the residuals are not normal. We should be very careful making conclusions. Transforming the data so that it is not right-skewed might help.
plot(fit)
Answer: We could perform a Tukey HSD test. We would conclude that most of the differences in mean appear to be between NSW and Victoria and Victoria and Qld.
TukeyHSD(aov(S~HIVD$state))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = S ~ HIVD$state)
##
## $`HIVD$state`
## diff lwr upr p adj
## Other-NSW -1.914256 -72.94390 69.11538 0.9998806
## QLD-NSW -26.694614 -96.43529 43.04606 0.7584450
## VIC-NSW 66.875885 18.29833 115.45344 0.0023175
## QLD-Other -24.780358 -118.42858 68.86786 0.9045544
## VIC-Other 68.790141 -10.36797 147.94825 0.1143435
## VIC-QLD 93.570499 15.56692 171.57408 0.0111243