This document covers the key objectives and concepts of data preparation, including joining and reducing data from multiple sources, handling missing and inconsistent data, and using RapidMiner for data preparation tasks. The goals are to ensure robust and consistent data by combining relevant data sources, filtering out unnecessary variables, addressing anomalies, and reformatting data for effective analysis. RapidMiner operators allow importing, merging, and transforming data from various file formats to prepare it for modeling.
2. Explain concepts and purpose of Data Preparation
Understand solutions for handling missing and
inconsistent data
Utilize data and attribute reduction techniques
Effectively work in RapidMiner to prepare your data.
This Week’s Learning Objectives
4. o Join data sets that are needed for your analysis
o Reduce data sets to only include pertinent
variables
o Scrub data to remove anomalies- outliers or
missing data
o Reformat for consistency and effective use
3. Data Preparation
5. Ensure robustness of data
o Combine more 2 or more data sets to create a “mini – database”
with all variables needed for analysis in one place.
o Merge by a unique identifier common to both data sets
“Key Identifier”, “Common ID”, “ID Number”, etc.
Example: Social Security Number (links Medical and Insurance)
Data Preparation
6. Data Preparation
Example: Sources of Data
Customer Purchases - “Point of Sale data” – CSV file format
Cost of Products Sold – “Accounting department” – Excel file format
Inventory of Products - “ IT Data Warehouse” - XML file format
Merge By Product ID or SKU
7. Data Reduction…two part
o Observations (rows, instances, etc)
o Attributes (variables, records, columns, etc)
Data Preparation
8. Attribute reduction to filter out irrelevant or
uninteresting data without completely removing them
from the original set.
Even if a variable isn’t interesting for answering some
questions, it may still be useful in others.
It is recommended to import all attributes first, then filter as necessary
Data Preparation
9. Observation Reduction…
Observation reduction is to reduce the # of observations to create a
smaller data set.
Some reasons to do so:
o Create a sample set for:
Training data, proof of concept analysis, testing theories, sharing data
o Improve analysis speed or process time
o Data scrubbing for outliers, missing values, etc.
Data Preparation
10. Ensure consistency of data
o Missing information
o Spelling errors, typos
o Multiple responses for an attribute
o Characters in numeric fields and vice-versa
Data Preparation
11. Ensure consistency of data
Data Preparation
KEY: Missing data is data that does not exist in a data set
• Not the same as zero or some other value
• In a dataset, it is blank and the value is unknown
• Sometimes referred to as null values
• Depending on your objective and the circumstance, you may
choose to leave missing data as they are or replace with some
other value
12. Ensure consistency of data
Data Preparation
KEY: Inconsistent data is different from missing data
• Occurs when a value does exist but its value is not valid
or meaningful.
• Common = “.” or “zero”
13. Ensure consistency of data
Data Preparation
Replace or remove missing or inconsistent data
• For numeric data…
• Can be replaced using Measures of Central Tendency
• Mean, Median, and Mode
• Mean - Average value
• Median - Middle value
• Mode - Most frequent or common value
14. Ensure consistency of data
Data Preparation
Replace or remove missing or inconsistent data
• For character data…
• Can be replaced using Best Estimated Value
• “Like Others”
• Ex. All males in data like bass fishing. If attribute “Fish Type” is blank and
attribute “Gender” equals male, then “Bass”
• “Clustering Techniques”
• “Best Guess”
15. Ensure consistency of data
Data Preparation
• Replacing missing or inconsistent values found in
data should be done:
• With intention, not haphazardly
• Use common sense
• Be transparent
It is recommended to always document your
missing or consistent data processes.
16. This course is a practical application course in Data Mining. Learning to
use RapidMiner is required.
If you have not done so yet, please plan to walk through the tutorial
examples in RapidMiner.
To assist you in understanding RapidMiner, I will take screenshots of what
I am doing to get the results we are looking for.
RapidMiner is pretty intuitive. You will get it quickly.
Basics of RapidMiner
17. Types of files that can be imported into RapidMiner:
o CSV File
o Excel File
o XML File
o Access Database Table
o … and much more
We use mainly CSV files which contain Comma Separated Values- be mindful if
your dataset contains commas
o Alternative delimiters can be selected in this case:
Tab
Semicolon
Pipe ( l ), etc.
Basics of RapidMiner
18. Three main areas that contain useful tools in
RapidMiner:
o Operators – Every possible task you can think of
o Repositories – Where you store your data
o Parameters – Task set up details
Basics of RapidMiner
26. Explain concepts and purpose of Data Preparation
Understand solutions for handling missing and inconsistent
data
Utilize data and attribute reduction techniques
Effectively work in RapidMiner to prepare your data.
Summary
27. “This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s
Employment and Training Administration. The solution was created by the grantee and does not
necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor
makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such
information, including any information on linked sites and including, but not limited to, accuracy of the
information or its completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building Capacity in
Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative
Commons Attribution 4.0 International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/
Copyright Information