# R programming Assignment 2020

## School of Computing, Engineering, and Mathematics

Question 1 (1 + 2 + 3 + 3 + 1 = 10)
Are the distributions the same?
Table 1 contains the fatalities due to trac accidents within Australia during 2018. Your task is to
determine whether the distribution of fatalities (across states) is the same as for the distribution of
population. The total number of fatalities during 2018 was 1,135.

NSW VIC QLD SA WA TAS NT ACT
Fatalities % 31 19 21 7 14 3 4 1
Population % 32 26 20 7 10 2 1 2
Table 1: Fatalities on Australian roads during 2018

(i)
Develop R code to load the above data within a single data frame. Include labels for the rows and
columns. Also briefly describe key parts of your code.
(ii)
Produce two di↵erent bar plots showing how the distributions of fatalities and population vary. The
di↵erence between the two plots should be to highlight key insights. Make the two plots worthy
of inclusion in a business report or research paper. Briefly describe what each plot shows and
make a prediction whether a statistically significant di↵erence exists. Briefly declare which plot you
consider most useful and why?
(iii)
Using a simulation approach, determine whether the distributions are di↵erent? In answering this
step, make sure you clearly include the following:
• The Null and Alternative Hypotheses used
• Any assumptions, or important details used
• Declare the result of the hypothesis test
• Briefly interpret the meaning of the hypothesis test result
(iv)
Repeat step (iii) using a statistical distribution approach.
(v)
Compare the results of steps (iii) and (iv) and briefly discuss, making your answer brief and to-thepoint.
6

Question 2 (1 + 2 + 1 + 3 + 3 = 10)
Is there a statistically significant di↵erence?
The Pima people of North America have one of the highest rates of type 2 diabetes in the world.
It appears that a number of social and environmental factors have contributed to the incidence of
diabetes for the Pima’s. The data set to be used for this question is found in the file called PIMA.csv
(use the version provided in vUWS), which contains information for 500 females and consists of the
following four features:
• age in years
• diastolic blood pressure
• Body Mass Index (BMI)
• whether the individual as ever been pregnant
an extract of the dataset is shown below in Table 2.
age diastolic bmi ever.pregnant
22 58 28.5 yes
38 82 33.3 yes
33 74 23.4 yes
31 66 26.6 yes
38 64 34.1 yes
23 62 24.0 yes
Table 2: Extract of Pima people dataset

The goal of this question is to understand the relationship between two sets of variables, defined
as follows:
R1 – bmi with respect to pregnancy status
R2 – diastolic blood pressure with respect to pregnancy status
(i)
Generate code to determine the variance of R1 and R2.
(ii)
Produce box plots for R1 and R2 and briefly interpret. Make both plots worthy of inclusion in a
(iii)
Given what you found regarding variances in step (i), select an appropriate inferential statistical
method and very briefly state the reason for that choice.
7
(iv)
Keeping in mind what you found regarding variances in step (i), perform a hypothesis test for R1
using your selected inferential statistical method. Make sure you clearly include the following:
• The Null and Alternative Hypotheses used
• Any assumptions, or important details / parameters used
• Declare the results of the hypothesis test
• Briefly interpret the meaning of the hypothesis test result
(v)
Repeat step (iv) using R2.
8

Question 3 (3 + 4 + 3 = 10)
Predicting demand
A particular help desk, always has fifteen operators on duty. On average only fourteen operators
are simultaneously busy helping customers.
(i)
What is the probability that all operators will be simultaneously busy? Briefly explain the approach
used.
(ii)
What is the probability that one or more callers will have to wait for an operator to become
available? Briefly explain the approach used.
(iii)
Draw a nicely presented plot showing operator demand for zero to twenty operators. Make this plot
appropriate for inclusion in a report to management, for purposes of determining whether more
operators are needed.

Question 4 (3 + 2 + 3 + 2 = 10)
Interval estimation
Using a confidence interval of 90%, what is the fewest and largest number of heads expected in
general, if a fair coin is tossed 25 times?
(i)
Use a simulation method to answer this question. Briefly explain the key steps involved.
(ii)
Produce a plot showing the resulting distribution from step (i), also clearly show the boundaries of
the confidence interval. Make the plot worthy of inclusion in a report or research paper.
(iii)
Now use an approximation method to answer this question. Briefly explain the key steps involved.
(iv)
Produce a plot showing the statistical distribution used in step (iii), also clearly show the boundaries
of the confidence interval. Make the plot worthy of inclusion in a report or research paper.