1. Objectives

In this assignment we will wrangle the data from a real-life dataset to understand different data wrangling techniques.

  • To conduct data exploration, preparation and transformation through different methods
  • To prepare the data ready for modeling, build and evaluate a simple linear regression model.
  • To document the analysis, comparison and findings
  • Dataset: supermarket sales forecast (regression problem)

The data (‘supermarket.csv’) have been collected at various supermarket outlets and stores in different cities. The aim is to predict the sales of each product at a particular outlet. Using this, supermarket management team will try to understand the properties of products and outlets which play a key role in increasing sales.

Detailed information (i.e. column description) is provided below.

Description: Text  Description automatically generated
  • Suggested tasks

You are suggested to complete this assignment following the below steps.

      Step 0: Exploratory Data Analysis (EDA)

Download the dataset from system, conduct exploratory data analysis using TIBCO Spotfire. Investigate the relationships between different features/variables. Which features are likely helpful for making predications?


Step 1: Load Data into Jupyter Notebook

Load the data into a DataFrame variable and provide an overview of the DataFrame variable using the relevant functions(e.g. head(), info(), describe() and etc.)

Step 2: Data Preprocessing

Are there any outliers? How did you identify them and how to deal with them? Are you happy with the distribution of the numerical variables? Do you need to transform the numerical variables using proper transformation methods (e.g. log transformation, Box-Cox and etc.)?

Step 3: Train and Test Split

Split the data into train data (70%) and test data (30%)

Step 4: Missing Value Imputation

Are there any missing values? How did you handle them and why?

Step 5: Categorical Data Encoding

Do you need to encode the Categorical Data? What methods do you use and why?

Step 6: Variable Discretization /Binning

Do you need to discretize /bin the Numerical Data? What methods do you use and why?

Step 7: Feature Engineer

Do you need to scale the data? What method do you use and why? Do you create any new features/variables and why? Do you drop any features/variables and why?

Step 8: Linear Regression Modelling

Build a linear regression model and evaluate the model performance. Are you happy with the model performance? If not, please review the previous steps 2-7 and see whether you can further wrangle the data to improve the model performance.

  • Suggested report format & content guidelines

Write an INDIVIDUAL report with the following sections (see Table below). Sample content is provided for each section. You are free to include other relevant information you deem necessary in the sections. You are strongly encouraged to try different methods at each section and provide detailed comparison and discussion in the report.

(Note: For a page with 1 inch margins, 11 point Calibri font, and minimal spacing elements, a good rule of thumb is 500 words for a single spaced page)

 Suggested Report Sections & Content GuidelinesWord Count
1.Table of Contents  NA
2.Introduction: Problem Understanding  Min: 100 words Max: 500 words
3.Explore the Data the relationship between different variables / featuresMin: 500 words Max: 1000 words
4.Cleanse the Data Missing DataOutliersMin: 500 words Max: 1000 words
5.Data Transformation Categorical Data (e.g. One hot encoding, Ordinal label encoding and etc.)Numerical Data (e.g. log transformation, binning)  Min: 500 words Max: 1000 words
6.Feature Engineer Feature ScalingCreate new features /Drop featuresMin: 500 words Max: 1000 words
7.Linear Regression Model  Build and Evaluate the modelMin: 500 words Max: 1000 words
5.Summary and Further Improvements Summarize your findingsExplain the possible further improvementsMin: 100 words Max: 500 words