Student information

Name: Ellis Patrick

SID: 12345678

Instructions

Question 1 - HIV

Dr P. J. Solomon and the Australian National Centre in HIV Epidemiology and Clinical Research performed a study on HIV transmission and survival. They recorded the following characteristics of all patients who entered a HIV clinical of the course of the study.

Variable Description
state Grouped state of origin: “NSW”includes ACT and “other” is WA, SA, NT and TAS.
sex Sex of patient.
diag (Julian) date of diagnosis.
death (Julian) date of death or end of observation.
status “A” (alive) or “D” (dead) at end of observation.
T.categ Reported transmission category.
age Age (years) at diagnosis.

Load data

To load the data use the link http://www.maths.usyd.edu.au/u/ellisp/AMED3002/HIV.csv as follows:

HIV = read.csv('http://www.maths.usyd.edu.au/u/ellisp/AMED3002/HIV.csv')

Data properties

  1. How many variables and observations are in the dataset HIV?

Answer: There are 7 variables and 2843 observations

head(HIV)
##   state sex  diag death status T.categ age
## 1   NSW   M 10905 11081      D      hs  35
## 2   NSW   M 11029 11096      D      hs  53
## 3   NSW   M  9551  9983      D      hs  42
## 4   NSW   M  9577  9654      D    haem  44
## 5   NSW   M 10015 10290      D      hs  39
## 6   NSW   M  9971 10344      D      hs  36
dim(HIV)
## [1] 2843    7
  1. Comment on the class of these variables and how they are stored in R.

Answer: All of the variables appear to be appropriately stored in R, representative of their actual class.

  • state is categorical stored as a factor
  • sex is categorical stored as a factor
  • diag is numeric stored as an integer
  • death is numeric stored as an integer
  • status is categorical stored as a factor
  • T.categ is categorical stored as a factor
  • age is numeric stored as an integer
sapply(HIV,class)
##     state       sex      diag     death    status   T.categ       age 
##  "factor"  "factor" "integer" "integer"  "factor"  "factor" "integer"
  1. Is there any missing data in this dataset?

Answer: No.

sum(is.na(HIV))
## [1] 0

State vs sex

The researches are interested in the characteristics for the pople being diagnosed in each state. Specifically, they would like to know whether there were differing numbers of men and women that were diagonsed in each state.

  1. What types of variables should sex and state be?

Answer: Sex and state should both be categorical.

  1. What is an appropriate statistical test that could be used by the researchers to test this question?

Answer: A chi-square test.

  1. What is the corresponding null and alternate hypothesis?

Answer: The null is that there Sex and state is no relationship between sex and state, the alternate is that there is a relationship.

  1. Construct a contingency table using the variables sex and state and comment on any striking features.

Answer: There are alot more men than women in the data set. NSW appears to have the most subjects.

table(HIV$state,HIV$sex)
##        
##            F    M
##   NSW     54 1726
##   Other   13  236
##   QLD      9  217
##   VIC     13  575
  1. Perform the appropriate test.
chisq.test(table(HIV$state,HIV$sex))
## 
##  Pearson's Chi-squared test
## 
## data:  table(HIV$state, HIV$sex)
## X-squared = 5.8235, df = 3, p-value = 0.1205
  1. Using a siginficance threshold of 0.05 what would you conclude from this test?

Answer: As the p-value is greater than 0.05 I would conclude that there is not enough evidence to reject the hypothesis that there is no relationship between sex and state.

  1. What were the assumptions for this test? Comment on them in the context of the observed data.

Answer: We assume that all the observations are independent and that the expected count for each cell is greater than five.

chisq.test(table(HIV$state,HIV$sex))$expected
##        
##                 F         M
##   NSW   55.722828 1724.2772
##   Other  7.794935  241.2051
##   QLD    7.074921  218.9251
##   VIC   18.407316  569.5927

Mortalilty differences between States.

The researchers would like to know if there is some difference between the states in the outcomes for HIV patients. They decide that they would like to see if the time between diagnosis and death are different between states.

  1. Create a new dataset containing only patients that died using the status variable.
HIVD = HIV[HIV$status=='D',]
  1. Create a new variable for the time that patients survived by subtracting diag from death.
S = HIVD$death-HIVD$diag
  1. Visualise the time to death for the patients in each state using a boxplot. Comment on any striking features.

Answer: There might be a few outliers in all of the states. The variances of each state are reasonably similar. The data could be right skewed for each state. The median of queensland appears to be the lowest and victoria the highest.

boxplot(S~HIVD$state)

  1. Use an ANOVA to test whether the time between diagnosis and death are different between states.
fit =aov(S~HIVD$state)
summary(fit)
##               Df    Sum Sq Mean Sq F value  Pr(>F)   
## HIVD$state     3   1477348  492449   5.124 0.00157 **
## Residuals   1757 168870047   96113                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Using a siginficance threshold of 0.01 what would you conclude from this test?

Answer: The p-value is less that 0.01 so we have enough evidence to reject the hypothesis that time between diagnosis and death are the same between states.

  1. What were the assumptions for this test? Comment on them in the context of the observed data and fitted model.

Answer: We assume independence, normality and equal variances. The variances appear equal however in the qqplot there might be evidence that the residuals are not normal. We should be very careful making conclusions. Transforming the data so that it is not right-skewed might help.

plot(fit)

  1. Are there any other tests that you could perform to help the researchers interpret these results. If yes, what would you inform the researchers?

Answer: We could perform a Tukey HSD test. We would conclude that most of the differences in mean appear to be between NSW and Victoria and Victoria and Qld.

TukeyHSD(aov(S~HIVD$state))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = S ~ HIVD$state)
## 
## $`HIVD$state`
##                 diff        lwr       upr     p adj
## Other-NSW  -1.914256  -72.94390  69.11538 0.9998806
## QLD-NSW   -26.694614  -96.43529  43.04606 0.7584450
## VIC-NSW    66.875885   18.29833 115.45344 0.0023175
## QLD-Other -24.780358 -118.42858  68.86786 0.9045544
## VIC-Other  68.790141  -10.36797 147.94825 0.1143435
## VIC-QLD    93.570499   15.56692 171.57408 0.0111243