BAS 250 Lecture 2

BAS 250
Lesson 2: Data Preparation

 Explain concepts and purpose of Data Preparation
 Understand solutions for handling missing and
inconsistent data
 Utilize data and attribute reduction techniques
 Effectively work in RapidMiner to prepare your data.
This Week’s Learning Objectives

The Data Mining Process: CRISP-DM

o Join data sets that are needed for your analysis
o Reduce data sets to only include pertinent
variables
o Scrub data to remove anomalies- outliers or
missing data
o Reformat for consistency and effective use
3. Data Preparation

 Ensure robustness of data
o Combine more 2 or more data sets to create a “mini – database”
with all variables needed for analysis in one place.
o Merge by a unique identifier common to both data sets
 “Key Identifier”, “Common ID”, “ID Number”, etc.
 Example: Social Security Number (links Medical and Insurance)
Data Preparation

Data Preparation
Example: Sources of Data
Customer Purchases - “Point of Sale data” – CSV file format
Cost of Products Sold – “Accounting department” – Excel file format
Inventory of Products - “ IT Data Warehouse” - XML file format
Merge By Product ID or SKU

 Data Reduction…two part
o Observations (rows, instances, etc)
o Attributes (variables, records, columns, etc)
Data Preparation

 Attribute reduction to filter out irrelevant or
uninteresting data without completely removing them
from the original set.
 Even if a variable isn’t interesting for answering some
questions, it may still be useful in others.
It is recommended to import all attributes first, then filter as necessary
Data Preparation

 Observation Reduction…
 Observation reduction is to reduce the # of observations to create a
smaller data set.
 Some reasons to do so:
o Create a sample set for:
 Training data, proof of concept analysis, testing theories, sharing data
o Improve analysis speed or process time
o Data scrubbing for outliers, missing values, etc.
Data Preparation

 Ensure consistency of data
o Missing information
o Spelling errors, typos
o Multiple responses for an attribute
o Characters in numeric fields and vice-versa
Data Preparation

Data Preparation
KEY: Missing data is data that does not exist in a data set
• Not the same as zero or some other value
• In a dataset, it is blank and the value is unknown
• Sometimes referred to as null values
• Depending on your objective and the circumstance, you may
choose to leave missing data as they are or replace with some
other value

Data Preparation
KEY: Inconsistent data is different from missing data
• Occurs when a value does exist but its value is not valid
or meaningful.
• Common = “.” or “zero”

Data Preparation
Replace or remove missing or inconsistent data
• For numeric data…
• Can be replaced using Measures of Central Tendency
• Mean, Median, and Mode
• Mean - Average value
• Median - Middle value
• Mode - Most frequent or common value

Data Preparation
Replace or remove missing or inconsistent data
• For character data…
• Can be replaced using Best Estimated Value
• “Like Others”
• Ex. All males in data like bass fishing. If attribute “Fish Type” is blank and
attribute “Gender” equals male, then “Bass”
• “Clustering Techniques”
• “Best Guess”

Data Preparation
• Replacing missing or inconsistent values found in
data should be done:
• With intention, not haphazardly
• Use common sense
• Be transparent
It is recommended to always document your
missing or consistent data processes.

 This course is a practical application course in Data Mining. Learning to
use RapidMiner is required.
 If you have not done so yet, please plan to walk through the tutorial
examples in RapidMiner.
 To assist you in understanding RapidMiner, I will take screenshots of what
I am doing to get the results we are looking for.
 RapidMiner is pretty intuitive. You will get it quickly.
Basics of RapidMiner

 Types of files that can be imported into RapidMiner:
o CSV File
o Excel File
o XML File
o Access Database Table
o … and much more
 We use mainly CSV files which contain Comma Separated Values- be mindful if
your dataset contains commas
o Alternative delimiters can be selected in this case:
 Tab
 Semicolon
 Pipe ( l ), etc.

 Three main areas that contain useful tools in
RapidMiner:
o Operators – Every possible task you can think of
o Repositories – Where you store your data
o Parameters – Task set up details

 Explain concepts and purpose of Data Preparation
 Understand solutions for handling missing and inconsistent
data
 Utilize data and attribute reduction techniques
 Effectively work in RapidMiner to prepare your data.
Summary

“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s
Employment and Training Administration. The solution was created by the grantee and does not
necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor
makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such
information, including any information on linked sites and including, but not limited to, accuracy of the
information or its completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building Capacity in
Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative
Commons Attribution 4.0 International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/
Copyright Information

BAS 250 Lecture 2

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a BAS 250 Lecture 2

Semelhante a BAS 250 Lecture 2 (20)

Mais de Wake Tech BAS

Mais de Wake Tech BAS (9)

Último

Último (20)

BAS 250 Lecture 2