MAST90044 Thinking and Reasoning with Data
Semester 1 2020
Assignment 2
Due: 9 am, Monday 18 May
• Assignments are to be submitted (uploaded) via Canvas.
• Your assignment should show all relevant and brief working and reasoning, as marks will be given for
method as well as for correct answers. Please spell check your document.
• Paste any relevant and brief R code and output into the appropriate places so that it can be seen easily
along with your other work. Graphics from R can be resized within your document; make them smaller
as necessary.
• Assignments count for 50% of the assessment in this subject. This one is worth 15%, and covers the
work done in chapters 4 to 6.
• The number of marks given for each question may be fine-tuned. The total number of marks for this
assignment is 45.
• Tutors will not help you directly with assignment questions. However, they may give some help with
R.
• Solutions to the assignment questions will be made available later.
• When constructing a panel of graphs with multiple plots, it is good to use the R command
par(mfrow = c(nrows,ncols)) where nrows is the number of rows and ncols the number of columns
in the panel. The default is (1,1).
MAST90044 Thinking and Reasoning with Data Assignment 2
Q1 The following table of frequencies shows age at first pregnancy by incidence of cervical cancer diagnosed
in women aged 50{59. Reference: Graham S and Shotz W (1979), Epidemiology of cancer of the cervix
in Buffalo, J National Cancer Inst 63(1):23{27.
Control Cervical cancer
Age at first 6 25
203 42
114 7
pregnancy > 25 (a) Enter the data into R and perform a chi-squared test of the association between age at first
pregnancy and incidence of cervical cancer. Is this test justified here? Briefly explain.
(b) Perform a test of the association using Fisher’s exact test and compare your conclusion here to
that from part (a). Explain briefly when the Fisher test would be preferred to the chi-squared test
of association.
[6 + 6 = 12 marks]
Q2 Ophthalmologists from Victoria and Western Australia have surveyed children in the Western Desert
in Western Australia to assess the prevalence and severity of trachoma. The data below come from two
years of a longitudinal survey. There are six stages of trachoma, of increasing severity. In this study,
children were observed to have trachoma up to the fourth stage. The data below show the stages of
trachoma including an additional level | those with no signs of trachoma.
Stage 1993 2003
None
Stage 1: Follicular
Stage 2: Intense inflammatory
Stage 3: Trachomatous scarring
Stage 4: Trichiasis 124 264
88 46
7 3
0 2
2 0
(a) Perform a suitable test to examine the association between severity of trachoma and year of survey.
What is your conclusion?
(b) Assess the validity of a politician’s claim that the prevalence (widespread presence) in 2003 was
20%.
[4 + 4 = 8 marks]
2
MAST90044 Thinking and Reasoning with Data Assignment 2
Q3 An investigator wished to determine whether epinephrine has the effect of elevating plasma cholesterol
levels in humans. Twelve adult males were selected and given both a placebo and the drug. Blood
samples were taken following injection of the placebo and again after injection of epinephrine. Analysis
of the blood samples resulted in the following data:
Cholesterol Levels (mg/100mL)
subject placebo epinephrine
1 178 184
2 240 243
3 210 210
4 184 189
5 190 200
6 181 191
7 156 150
8 220 226
9 210 220
10 165 163
11 188 192
12 214 216
These data are also available in TRD=asst03data.csv on LMS.
(a) Formulate an appropriate statistical model, defining all the terms. State the null and two-sided
alternative hypotheses which reflect the research question of interest.
(b) Enter the data into R, and calculate the means for placebo and epinephrine.
Find a 95% confidence interval for the mean difference in cholesterol levels between the placebo
and epinephrine. Use the confidence interval to test your null hypothesis.
(c) Would a 99% confidence interval contain zero? Briefly explain.
[4 + 4 + 2 = 10 marks]
Q4 Transient hypothyroxinemia is a common finding in premature infants. It is not thought to have longterm consequences, or to require treatment. A study was performed to investigate whether it might
have long-term effects, and to this end, blood thyroxine values were obtained on routine screening in
the first week of life for a sample of infants who weighed 2000g or less at birth and were born at 33
weeks gestation or earlier. These results will later be related to motor and cognitive development.
Our aim here is to develop a model to estimate the thyroxin level for a specified gestational age. The
data are available in (TRD=asstQ4data.csv) on LMS:
g.age thyroxine
30 8.1
28 7.2
31 9.2
…
…
(a) Read the data into R and produce an appropriate graphical summary (with meaningful labels) of
the relationship between thyroxin level and gestational age.
(b) Write down an appropriate statistical model for examining the relationship, and fit the model in
R.
(c) i. Give a non-statistical interpretation of the coefficient of g.age.
ii. Find a 95% confidence interval for this coefficient.
3
MAST90044 Thinking and Reasoning with Data Assignment 2
iii. Is thyroxine level related to gestational age? Explain.
iv. What percentage of the total variation in thyroxine level is explained by gestational age?
(d) A record of a new baby became available. Find an interval within which the thyroxine level of this
premature baby of gestational age 31 weeks is likely to lie. Use 95% confidence.
(e) Examine appropriate diagnostic plots and comment on anything that is noteworthy or that may
challenge the assumption of the model.
[2 + 4 + 4 + 2 + 3 = 15 marks]
Total marks = 45