## Big Data

#### Learning Outcomes:

1 Explain the concept of Big Data and its importance in a modern economy

2 Explain the core architecture and algorithms underpinning big data processing

3 Analyse and visualize large data sets using a range of statistical and big data technologies

4 Critically evaluate, select and employ appropriate tools and technologies for the development of big data applications

#### Detailed Specification

You are expected to work individually and complete a report that addresses the following tasks. You need tocite all sources you rely on with in-text style. You may include material discussed in the lectures or labs, but additional credit will be given for independent research. Note: References should be in Harvard format.The word count does NOT include references.

Part A

• TaskA.1 [mark 10]Explain the main characteristics of Big Data.(Wordcount: 200 words ±10%)
• Task A.2[mark 15]Compare Hadoop and Relational Database Systems. Give an application scenario that is well suited to Hadoop and explain yourreason. (Word count: 300 words ±10%)

Part B: MapReduce Programming

Suppose that youhave a large student file which cannot be stored in a single machine. Each record of this file contains information: (Student_ID, Student_Name, Sex, Age, Module, Grade, Department).

• Task B.1[mark15]Please design aMapReduce Algorithm (Pseudo-codes or Java Codes) to output the average grade for each module.Thealgorithm is expected to be as efficient as possible.
• Task B.2[mark 15]Describe thealgorithm designed. You should explain how the input is mapped into (key, value) pairs by the map stage, i.e., specify what is the key and what is the associated value in each pair, and, if needed, how the key(s) and value(s) are computed. Then you should explain how the output (key, value) pairs ofthe map stage are processed by the reduce stage to get the final answer(s). You shouldalsoanalyse the efficiency of theMapReduce algorithmdesigned. (Word count: 300 words ±10%)

Part C: Big Data Project Analysis

The CropY company is a leading provider of precision agriculture service. Precision agriculture is the science of gathering, processing, and analysing temporal, spatial and individual data. It combines other information to support management decisions according to estimated variability for improved resource use efficiency, productivity, quality, profitability.

The CropYcompany is now plan to develop a big data project to meet the following requirements: help worldwide users better understanding the implications of the weather and making contingency plans; buying supplies, such as fertilizer and seeds; as well as maintaining and monitoring the quality of yield, whether livestock or crops; knowing the variety of cultivated plants, conditions of its growth and its needs of seeds; choosing the type of fertilizer and pesticides, understanding their employment conditions and their impact on the climate-soil-plant; recognizing daily water needs for each kind of plant; calculating the median and mean values of yield; studying the conditions of natural environment; estimating the financial revenue and manage the potential risks.

• Task C.1[mark 10]: The volume of big data is expected to be more than 500 Petabytes. The data will come from various sensors, satellites, drones, social media, market data, Online news feed etc. The Figure 1 below shows some example data of CropYcompany. Some IT technician planto build a data warehouse to store data for further data analysis tasksbut some others believe data lake is a better choice. Which choice do you prefer? Please justify your choice. (Word count: 300 words ±10%)
• Task C.2[mark 10]: The data of CropYcompany includesa large collection of plants, corps, diseases, symptoms, pests, and relationships between them. The CropYcompany needsto build a data analytical store which canfacilitate queries like: “find all diseases which are directly or indirectly caused by nitrogen deficiency”.Please recommend adata store and justify your choice.(Word count: 300 words ±10%)
• Task C.3[mark 15]: Some prediction and analytics services provided by the CropYcompany require to response in a few seconds after the arrival of new data. Namely, they are real time or near real time prediction and analytics tasks. Some IT managers suggested a popular distributed processing framework —MapReduce to implementthese tasks. Do you agree with that? Please justify your choice.(Word count: 300 words ±10%)
• Task C.4[mark 10]:CropYcompany decidedtomove most of applications and services to cloud. These applications and services need to be highly available, scalable, and accessible from worldwide. Note that some data such as price and customer data are confidential. Please design a cloud hosting strategy for this big data project and explain how your design will meet the security, scalability, high availability. (Word count: 300 words ±10%)