In this assignment we will wrangle the data from a real-life dataset to understand different data wrangling techniques.
- To conduct data exploration, preparation and transformation through different methods
- To prepare the data ready for modeling, build and evaluate a simple linear regression model.
- To document the analysis, comparison and findings
- Dataset: supermarket sales forecast (regression problem)
The data (‘supermarket.csv’) have been collected at various supermarket outlets and stores in different cities. The aim is to predict the sales of each product at a particular outlet. Using this, supermarket management team will try to understand the properties of products and outlets which play a key role in increasing sales.
Detailed information (i.e. column description) is provided below.
- Suggested tasks
You are suggested to complete this assignment following the below steps.
Step 0: Exploratory Data Analysis (EDA)
Download the dataset from system, conduct exploratory data analysis using TIBCO Spotfire. Investigate the relationships between different features/variables. Which features are likely helpful for making predications?
ALL THE BELOW STEPS WILL BE DONE THROUGH PYTHON.
Step 1: Load Data into Jupyter Notebook
Load the data into a DataFrame variable and provide an overview of the DataFrame variable using the relevant functions(e.g. head(), info(), describe() and etc.)
Step 2: Data Preprocessing
Are there any outliers? How did you identify them and how to deal with them? Are you happy with the distribution of the numerical variables? Do you need to transform the numerical variables using proper transformation methods (e.g. log transformation, Box-Cox and etc.)?
Step 3: Train and Test Split
Split the data into train data (70%) and test data (30%)
Step 4: Missing Value Imputation
Are there any missing values? How did you handle them and why?
Step 5: Categorical Data Encoding
Do you need to encode the Categorical Data? What methods do you use and why?
Step 6: Variable Discretization /Binning
Do you need to discretize /bin the Numerical Data? What methods do you use and why?
Step 7: Feature Engineer
Do you need to scale the data? What method do you use and why? Do you create any new features/variables and why? Do you drop any features/variables and why?
Step 8: Linear Regression Modelling
Build a linear regression model and evaluate the model performance. Are you happy with the model performance? If not, please review the previous steps 2-7 and see whether you can further wrangle the data to improve the model performance.
- Suggested report format & content guidelines
Write an INDIVIDUAL report with the following sections (see Table below). Sample content is provided for each section. You are free to include other relevant information you deem necessary in the sections. You are strongly encouraged to try different methods at each section and provide detailed comparison and discussion in the report.
(Note: For a page with 1 inch margins, 11 point Calibri font, and minimal spacing elements, a good rule of thumb is 500 words for a single spaced page)
|Suggested Report Sections & Content Guidelines||Word Count|
|1.||Table of Contents||NA|
|2.||Introduction: Problem Understanding||Min: 100 words Max: 500 words|
|3.||Explore the Data the relationship between different variables / features||Min: 500 words Max: 1000 words|
|4.||Cleanse the Data Missing DataOutliers||Min: 500 words Max: 1000 words|
|5.||Data Transformation Categorical Data (e.g. One hot encoding, Ordinal label encoding and etc.)Numerical Data (e.g. log transformation, binning)||Min: 500 words Max: 1000 words|
|6.||Feature Engineer Feature ScalingCreate new features /Drop features||Min: 500 words Max: 1000 words|
|7.||Linear Regression Model Build and Evaluate the model||Min: 500 words Max: 1000 words|
|5.||Summary and Further Improvements Summarize your findingsExplain the possible further improvements||Min: 100 words Max: 500 words|