This assignment assesses the following learning outcomes:
- Apply and justify a range of statistical techniques to the extraction of business information from data.
- Critically evaluate the validity of the techniques and models employed with respect to the relevant data and also to their intended use.
- Interpret the intelligence provided in a practical business setting.
- Effectively communicate the relevant methodology and its results to a decision maker.
Section 1.0: Using sources from the literature, provide a critical evaluation of data mining techniques in business analytics [20 Marks]
To answer Section 1 effectively you should make reference to published books, journals and blogs. You can also use industry white papers from renowned sources such as Gartner, PWC, MarketWatch, IBM, AcNielsen etc.
You required are to answer either Option 1A OR Option 1B but NOT both. However, you must answer ALL questions from Section 2.
Option 1A: Review of a good Data Mining Book or Industry Publication
Below are some example topics. Do note that this is not an exhaustive list:
- What methodologies are used in Data Mining Projects (“SEMMA, CRISP-DM, Method A and Virtuous Circle” is a good search starting point). What does the DM industry use?
- Industry-scale Data Mining probably uses many software tools to help identify patterns and knowledge discovery. What are the main tools? Which are the processes most often covered by tools? How do these tools compare with the SAS Enterprise Miner used in this module?
- Data mining have several advantages when used in particular industries. However, there are also limitations associated with Data mining. Critically examine the pros and cons of data mining in different industries in a greater detail.
Option 1B: Review of a good Data Mining Research Publication
Find an example in the academic literature (i.e. an article or paper from a reputable academic journal or conference such as ACM, IEEE, Elsevier, Springer or Science Direct) where data mining has been used. You should select an article where either clustering or a decision tree or another data mining technique has been used. Discuss your chosen article. Briefly describe the situation in which it was applied, what was discovered and whether (with reasons) you think data mining was used effectively.
The total writing for Section 1.0 should be around 1000 words. Words beyond 1100 will not be read. Referencing style must follow the APA style: http://libguides.shu.ac.uk/referencing
Section 2.0 : Apply practical Data Mining tools in real problem context [25 Marks]
For this part, you are required to analyse a data set taken from the data mining competition prior to the third international conference of Principles and Practices of knowledge discovery in data bases (PKDD). This conference was held in Prague in 19991. One of the challenges given for the competition was a set of datasets concerning financial transactions and details for customers at a Czech bank. The ERD of the database is shown in Figure 1.1 below
Details of the Query and the Resulting Data
We wish to build a model of customers for the bank in order to gain some insight into the patterns that exist in the customer groups. Several queries have been developed to give a final one QueryR described in the appendix. There were 4500 records for these customers. For each customer different types of credits and withdrawals take place, these are categorised as follows:
Credits (Paying money into your account): Cash; Bank collect; other
Withdrawal(Taking money out of your account): Cash; Bank remittance; Card
From the transactional table. it is possible to calculate the number of each type of transaction or the total value of each type of transaction. From these the average value of each type of transaction has been calculated by dividing the total value of transactions by the number of transactions. These have prefix a (e.g. acredit). The resulting final table was produced in the access database and is called Queryr. It also contains other background information such as: age, sex, if there is a second account holder (second), if the client has a loan (loan) and the frequency of the issuance of statements (frequency).
The bank wishes to see if different customers have similar financial profiles and have therefore asked that the Query r data be clustered. They are looking for about eight clusters. Since cluster analysis requires the use of fields that are symmetrical as possible each field in the Query r data is investigated. This resulted in the plots and the table of suitable summary measures given in Figure 2.2 and Table 2.1 shown above.
1. Using only the plots shown in Figure 2.2 discuss the variables in the dataset. What other features do you notice? (Hint: You may want to group the plots into the different variable types i.e. interval, binary/nominal. For interval variables – you can further discuss them based on their shapes) [7 Marks]
2. Discuss the summary measures: min, max, mean, range and Std. Dev. (Standard deviation) in Table 2.1 to explain any other features of the data. Do any of these measures help in understanding some of the features you discussed in question 1? If so explain how. [4 Marks]
3. Use the Skewness shown in Table 2.1 to further investigate the shape of the distribution of each field. Do your results confirm what you found in question 1? [5 Marks]
4. A final cluster solution is produced. Some of the results are shown in Figure 2.6 and Figure 2.7 Using these results, answer the following:
a) How many clusters have been fitted? Which cluster has the most observations (customers) in it? Which has the least? [2 Marks]
b) Which cluster has customers that are most similar (consistent)? Explain what evidence there is for this. [2 Marks]