# R programming Assignment 2020

## School of Computing, Engineering, and Mathematics

**Question 1 (1 + 2 + 3 + 3 + 1 = 10)**

Are the distributions the same?

Table 1 contains the fatalities due to trac accidents within Australia during 2018. Your task is to

determine whether the distribution of fatalities (across states) is the same as for the distribution of

population. The total number of fatalities during 2018 was 1,135.

NSW VIC QLD SA WA TAS NT ACT

Fatalities % 31 19 21 7 14 3 4 1

Population % 32 26 20 7 10 2 1 2

Table 1: Fatalities on Australian roads during 2018

(i)

Develop R code to load the above data within a single data frame. Include labels for the rows and

columns. Also briefly describe key parts of your code.

(ii)

Produce two di↵erent bar plots showing how the distributions of fatalities and population vary. The

di↵erence between the two plots should be to highlight key insights. Make the two plots worthy

of inclusion in a business report or research paper. Briefly describe what each plot shows and

make a prediction whether a statistically significant di↵erence exists. Briefly declare which plot you

consider most useful and why?

(iii)

Using a simulation approach, determine whether the distributions are di↵erent? In answering this

step, make sure you clearly include the following:

• The Null and Alternative Hypotheses used

• Any assumptions, or important details used

• Declare the result of the hypothesis test

• Briefly interpret the meaning of the hypothesis test result

(iv)

Repeat step (iii) using a statistical distribution approach.

(v)

Compare the results of steps (iii) and (iv) and briefly discuss, making your answer brief and to-thepoint.

6

**Question 2 (1 + 2 + 1 + 3 + 3 = 10)**

Is there a statistically significant di↵erence?

The Pima people of North America have one of the highest rates of type 2 diabetes in the world.

It appears that a number of social and environmental factors have contributed to the incidence of

diabetes for the Pima’s. The data set to be used for this question is found in the file called PIMA.csv

(use the version provided in vUWS), which contains information for 500 females and consists of the

following four features:

• age in years

• diastolic blood pressure

• Body Mass Index (BMI)

• whether the individual as ever been pregnant

an extract of the dataset is shown below in Table 2.

age diastolic bmi ever.pregnant

22 58 28.5 yes

38 82 33.3 yes

33 74 23.4 yes

31 66 26.6 yes

38 64 34.1 yes

23 62 24.0 yes

Table 2: Extract of Pima people dataset

The goal of this question is to understand the relationship between two sets of variables, defined

as follows:

R1 – bmi with respect to pregnancy status

R2 – diastolic blood pressure with respect to pregnancy status

(i)

Generate code to determine the variance of R1 and R2.

(ii)

Produce box plots for R1 and R2 and briefly interpret. Make both plots worthy of inclusion in a

business report or research paper.

(iii)

Given what you found regarding variances in step (i), select an appropriate inferential statistical

method and very briefly state the reason for that choice.

7

(iv)

Keeping in mind what you found regarding variances in step (i), perform a hypothesis test for R1

using your selected inferential statistical method. Make sure you clearly include the following:

• The Null and Alternative Hypotheses used

• Any assumptions, or important details / parameters used

• Declare the results of the hypothesis test

• Briefly interpret the meaning of the hypothesis test result

(v)

Repeat step (iv) using R2.

8

**Question 3 (3 + 4 + 3 = 10)**

Predicting demand

A particular help desk, always has fifteen operators on duty. On average only fourteen operators

are simultaneously busy helping customers.

(i)

What is the probability that all operators will be simultaneously busy? Briefly explain the approach

used.

(ii)

What is the probability that one or more callers will have to wait for an operator to become

available? Briefly explain the approach used.

(iii)

Draw a nicely presented plot showing operator demand for zero to twenty operators. Make this plot

appropriate for inclusion in a report to management, for purposes of determining whether more

operators are needed.

**Question 4 (3 + 2 + 3 + 2 = 10)**

Interval estimation

Using a confidence interval of 90%, what is the fewest and largest number of heads expected in

general, if a fair coin is tossed 25 times?

(i)

Use a simulation method to answer this question. Briefly explain the key steps involved.

(ii)

Produce a plot showing the resulting distribution from step (i), also clearly show the boundaries of

the confidence interval. Make the plot worthy of inclusion in a report or research paper.

(iii)

Now use an approximation method to answer this question. Briefly explain the key steps involved.

(iv)

Produce a plot showing the statistical distribution used in step (iii), also clearly show the boundaries

of the confidence interval. Make the plot worthy of inclusion in a report or research paper.