SlideShare uma empresa Scribd logo
1 de 27
BAS 250
Lesson 2: Data Preparation
 Explain concepts and purpose of Data Preparation
 Understand solutions for handling missing and
inconsistent data
 Utilize data and attribute reduction techniques
 Effectively work in RapidMiner to prepare your data.
This Week’s Learning Objectives
The Data Mining Process: CRISP-DM
o Join data sets that are needed for your analysis
o Reduce data sets to only include pertinent
variables
o Scrub data to remove anomalies- outliers or
missing data
o Reformat for consistency and effective use
3. Data Preparation
 Ensure robustness of data
o Combine more 2 or more data sets to create a “mini – database”
with all variables needed for analysis in one place.
o Merge by a unique identifier common to both data sets
 “Key Identifier”, “Common ID”, “ID Number”, etc.
 Example: Social Security Number (links Medical and Insurance)
Data Preparation
Data Preparation
Example: Sources of Data
Customer Purchases - “Point of Sale data” – CSV file format
Cost of Products Sold – “Accounting department” – Excel file format
Inventory of Products - “ IT Data Warehouse” - XML file format
Merge By Product ID or SKU
 Data Reduction…two part
o Observations (rows, instances, etc)
o Attributes (variables, records, columns, etc)
Data Preparation
 Attribute reduction to filter out irrelevant or
uninteresting data without completely removing them
from the original set.
 Even if a variable isn’t interesting for answering some
questions, it may still be useful in others.
It is recommended to import all attributes first, then filter as necessary
Data Preparation
 Observation Reduction…
 Observation reduction is to reduce the # of observations to create a
smaller data set.
 Some reasons to do so:
o Create a sample set for:
 Training data, proof of concept analysis, testing theories, sharing data
o Improve analysis speed or process time
o Data scrubbing for outliers, missing values, etc.
Data Preparation
 Ensure consistency of data
o Missing information
o Spelling errors, typos
o Multiple responses for an attribute
o Characters in numeric fields and vice-versa
Data Preparation
 Ensure consistency of data
Data Preparation
KEY: Missing data is data that does not exist in a data set
• Not the same as zero or some other value
• In a dataset, it is blank and the value is unknown
• Sometimes referred to as null values
• Depending on your objective and the circumstance, you may
choose to leave missing data as they are or replace with some
other value
 Ensure consistency of data
Data Preparation
KEY: Inconsistent data is different from missing data
• Occurs when a value does exist but its value is not valid
or meaningful.
• Common = “.” or “zero”
 Ensure consistency of data
Data Preparation
Replace or remove missing or inconsistent data
• For numeric data…
• Can be replaced using Measures of Central Tendency
• Mean, Median, and Mode
• Mean - Average value
• Median - Middle value
• Mode - Most frequent or common value
 Ensure consistency of data
Data Preparation
Replace or remove missing or inconsistent data
• For character data…
• Can be replaced using Best Estimated Value
• “Like Others”
• Ex. All males in data like bass fishing. If attribute “Fish Type” is blank and
attribute “Gender” equals male, then “Bass”
• “Clustering Techniques”
• “Best Guess”
 Ensure consistency of data
Data Preparation
• Replacing missing or inconsistent values found in
data should be done:
• With intention, not haphazardly
• Use common sense
• Be transparent
It is recommended to always document your
missing or consistent data processes.
 This course is a practical application course in Data Mining. Learning to
use RapidMiner is required.
 If you have not done so yet, please plan to walk through the tutorial
examples in RapidMiner.
 To assist you in understanding RapidMiner, I will take screenshots of what
I am doing to get the results we are looking for.
 RapidMiner is pretty intuitive. You will get it quickly.
Basics of RapidMiner
 Types of files that can be imported into RapidMiner:
o CSV File
o Excel File
o XML File
o Access Database Table
o … and much more
 We use mainly CSV files which contain Comma Separated Values- be mindful if
your dataset contains commas
o Alternative delimiters can be selected in this case:
 Tab
 Semicolon
 Pipe ( l ), etc.
Basics of RapidMiner
 Three main areas that contain useful tools in
RapidMiner:
o Operators – Every possible task you can think of
o Repositories – Where you store your data
o Parameters – Task set up details
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
 Explain concepts and purpose of Data Preparation
 Understand solutions for handling missing and inconsistent
data
 Utilize data and attribute reduction techniques
 Effectively work in RapidMiner to prepare your data.
Summary
“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s
Employment and Training Administration. The solution was created by the grantee and does not
necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor
makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such
information, including any information on linked sites and including, but not limited to, accuracy of the
information or its completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building Capacity in
Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative
Commons Attribution 4.0 International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/
Copyright Information

Mais conteúdo relacionado

Mais procurados

Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsUmasree Raghunath
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
 
SAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data AnalyticsSAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data AnalyticsSteven Kimber
 
Introducing SPSS customer overview
Introducing SPSS customer overviewIntroducing SPSS customer overview
Introducing SPSS customer overviewebuc
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceVignesh Prajapati
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project LifecycleJason Geng
 
Challenges in business analytics
Challenges in business analyticsChallenges in business analytics
Challenges in business analyticsMiklos Koren
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataSeth Grimes
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryDomino Data Lab
 
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...Edureka!
 

Mais procurados (20)

Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
SAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data AnalyticsSAS/MIT/Sloan Data Analytics
SAS/MIT/Sloan Data Analytics
 
Introducing SPSS customer overview
Introducing SPSS customer overviewIntroducing SPSS customer overview
Introducing SPSS customer overview
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data Mining Technique - SEMMA
Data Mining Technique - SEMMAData Mining Technique - SEMMA
Data Mining Technique - SEMMA
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Challenges in business analytics
Challenges in business analyticsChallenges in business analytics
Challenges in business analytics
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
Reports vs analysis
Reports vs analysisReports vs analysis
Reports vs analysis
 
Analytics
AnalyticsAnalytics
Analytics
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive Industry
 
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
 

Destaque

Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
Learning SAS by Example -A Programmer’s Guide by Ron CodySolutionLearning SAS by Example -A Programmer’s Guide by Ron CodySolution
Learning SAS by Example -A Programmer’s Guide by Ron CodySolutionVibeesh CS
 
Basics Of SAS Programming Language
Basics Of SAS Programming LanguageBasics Of SAS Programming Language
Basics Of SAS Programming Languageguest2160992
 
SAS Training session - By Pratima
SAS Training session  -  By Pratima SAS Training session  -  By Pratima
SAS Training session - By Pratima Pratima Pandey
 
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20Ayapparaj SKS
 
Where Vs If Statement
Where Vs If StatementWhere Vs If Statement
Where Vs If StatementSunil Gupta
 
Base 9.1 preparation guide
Base 9.1 preparation guideBase 9.1 preparation guide
Base 9.1 preparation guideimaduddin91
 
Analytics with SAS
Analytics with SASAnalytics with SAS
Analytics with SASEdureka!
 
Base SAS Full Sample Paper
Base SAS Full Sample Paper Base SAS Full Sample Paper
Base SAS Full Sample Paper Jimmy Rana
 
Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .Felicita Florence
 
Base SAS Exam Questions
Base SAS Exam QuestionsBase SAS Exam Questions
Base SAS Exam Questionsguestc45097
 
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaBig Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaEdureka!
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsBuhwan Jeong
 
The Second Little Book of Leadership
The Second Little Book of LeadershipThe Second Little Book of Leadership
The Second Little Book of LeadershipPhil Dourado
 
Best Presentation About Infosys
Best Presentation About InfosysBest Presentation About Infosys
Best Presentation About InfosysDurgadatta Dash
 

Destaque (20)

Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
Learning SAS by Example -A Programmer’s Guide by Ron CodySolutionLearning SAS by Example -A Programmer’s Guide by Ron CodySolution
Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
 
SAS BASICS
SAS BASICSSAS BASICS
SAS BASICS
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Basics Of SAS Programming Language
Basics Of SAS Programming LanguageBasics Of SAS Programming Language
Basics Of SAS Programming Language
 
SAS Training session - By Pratima
SAS Training session  -  By Pratima SAS Training session  -  By Pratima
SAS Training session - By Pratima
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20
 
SAS TRAINING
SAS TRAININGSAS TRAINING
SAS TRAINING
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
 
Where Vs If Statement
Where Vs If StatementWhere Vs If Statement
Where Vs If Statement
 
Base 9.1 preparation guide
Base 9.1 preparation guideBase 9.1 preparation guide
Base 9.1 preparation guide
 
Analytics with SAS
Analytics with SASAnalytics with SAS
Analytics with SAS
 
Sas demo
Sas demoSas demo
Sas demo
 
Base SAS Full Sample Paper
Base SAS Full Sample Paper Base SAS Full Sample Paper
Base SAS Full Sample Paper
 
Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .Statistical analytical programming for social media analysis .
Statistical analytical programming for social media analysis .
 
Base SAS Exam Questions
Base SAS Exam QuestionsBase SAS Exam Questions
Base SAS Exam Questions
 
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaBig Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 
The Second Little Book of Leadership
The Second Little Book of LeadershipThe Second Little Book of Leadership
The Second Little Book of Leadership
 
Best Presentation About Infosys
Best Presentation About InfosysBest Presentation About Infosys
Best Presentation About Infosys
 

Semelhante a BAS 250 Lecture 2

data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhVISHALMARWADE1
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxPratikshaSurve4
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptbelay41
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
Pandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxPandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxbajajrishabh96tech
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesCarl Anderson
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processingMahmoud Alfarra
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amatoSSSW
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aRai University
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxAkash527744
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Qualitypriyanka rajput
 

Semelhante a BAS 250 Lecture 2 (20)

data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 
Metopen 6
Metopen 6Metopen 6
Metopen 6
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptx
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Business analyst
Business analystBusiness analyst
Business analyst
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
Data Science in Python.pptx
Data Science in Python.pptxData Science in Python.pptx
Data Science in Python.pptx
 
Pandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxPandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptx
 
Data Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practicesData Quality: principles, approaches, and best practices
Data Quality: principles, approaches, and best practices
 
Data processing
Data processingData processing
Data processing
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processing
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Mba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation aMba ii rm unit-4.1 data analysis & presentation a
Mba ii rm unit-4.1 data analysis & presentation a
 
BDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptxBDA TAE 2 (BMEB 83).pptx
BDA TAE 2 (BMEB 83).pptx
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
 

Mais de Wake Tech BAS

BAS 150 Lesson 8 Lecture
BAS 150 Lesson 8 LectureBAS 150 Lesson 8 Lecture
BAS 150 Lesson 8 LectureWake Tech BAS
 
BAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 LectureBAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 LectureWake Tech BAS
 
BAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 LectureBAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 LectureWake Tech BAS
 
BAS 150 Lesson 5 Lecture
BAS 150 Lesson 5 LectureBAS 150 Lesson 5 Lecture
BAS 150 Lesson 5 LectureWake Tech BAS
 
BAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 LectureBAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 LectureWake Tech BAS
 
BAS 150 Lesson 3 Lecture
BAS 150 Lesson 3 LectureBAS 150 Lesson 3 Lecture
BAS 150 Lesson 3 LectureWake Tech BAS
 

Mais de Wake Tech BAS (9)

BAS 250 Lecture 5
BAS 250 Lecture 5BAS 250 Lecture 5
BAS 250 Lecture 5
 
BAS 250 Lecture 4
BAS 250 Lecture 4BAS 250 Lecture 4
BAS 250 Lecture 4
 
BAS 250 Lecture 3
BAS 250 Lecture 3BAS 250 Lecture 3
BAS 250 Lecture 3
 
BAS 150 Lesson 8 Lecture
BAS 150 Lesson 8 LectureBAS 150 Lesson 8 Lecture
BAS 150 Lesson 8 Lecture
 
BAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 LectureBAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 Lecture
 
BAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 LectureBAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 Lecture
 
BAS 150 Lesson 5 Lecture
BAS 150 Lesson 5 LectureBAS 150 Lesson 5 Lecture
BAS 150 Lesson 5 Lecture
 
BAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 LectureBAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 Lecture
 
BAS 150 Lesson 3 Lecture
BAS 150 Lesson 3 LectureBAS 150 Lesson 3 Lecture
BAS 150 Lesson 3 Lecture
 

Último

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...KokoStevan
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 

Último (20)

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 

BAS 250 Lecture 2

  • 1. BAS 250 Lesson 2: Data Preparation
  • 2.  Explain concepts and purpose of Data Preparation  Understand solutions for handling missing and inconsistent data  Utilize data and attribute reduction techniques  Effectively work in RapidMiner to prepare your data. This Week’s Learning Objectives
  • 3. The Data Mining Process: CRISP-DM
  • 4. o Join data sets that are needed for your analysis o Reduce data sets to only include pertinent variables o Scrub data to remove anomalies- outliers or missing data o Reformat for consistency and effective use 3. Data Preparation
  • 5.  Ensure robustness of data o Combine more 2 or more data sets to create a “mini – database” with all variables needed for analysis in one place. o Merge by a unique identifier common to both data sets  “Key Identifier”, “Common ID”, “ID Number”, etc.  Example: Social Security Number (links Medical and Insurance) Data Preparation
  • 6. Data Preparation Example: Sources of Data Customer Purchases - “Point of Sale data” – CSV file format Cost of Products Sold – “Accounting department” – Excel file format Inventory of Products - “ IT Data Warehouse” - XML file format Merge By Product ID or SKU
  • 7.  Data Reduction…two part o Observations (rows, instances, etc) o Attributes (variables, records, columns, etc) Data Preparation
  • 8.  Attribute reduction to filter out irrelevant or uninteresting data without completely removing them from the original set.  Even if a variable isn’t interesting for answering some questions, it may still be useful in others. It is recommended to import all attributes first, then filter as necessary Data Preparation
  • 9.  Observation Reduction…  Observation reduction is to reduce the # of observations to create a smaller data set.  Some reasons to do so: o Create a sample set for:  Training data, proof of concept analysis, testing theories, sharing data o Improve analysis speed or process time o Data scrubbing for outliers, missing values, etc. Data Preparation
  • 10.  Ensure consistency of data o Missing information o Spelling errors, typos o Multiple responses for an attribute o Characters in numeric fields and vice-versa Data Preparation
  • 11.  Ensure consistency of data Data Preparation KEY: Missing data is data that does not exist in a data set • Not the same as zero or some other value • In a dataset, it is blank and the value is unknown • Sometimes referred to as null values • Depending on your objective and the circumstance, you may choose to leave missing data as they are or replace with some other value
  • 12.  Ensure consistency of data Data Preparation KEY: Inconsistent data is different from missing data • Occurs when a value does exist but its value is not valid or meaningful. • Common = “.” or “zero”
  • 13.  Ensure consistency of data Data Preparation Replace or remove missing or inconsistent data • For numeric data… • Can be replaced using Measures of Central Tendency • Mean, Median, and Mode • Mean - Average value • Median - Middle value • Mode - Most frequent or common value
  • 14.  Ensure consistency of data Data Preparation Replace or remove missing or inconsistent data • For character data… • Can be replaced using Best Estimated Value • “Like Others” • Ex. All males in data like bass fishing. If attribute “Fish Type” is blank and attribute “Gender” equals male, then “Bass” • “Clustering Techniques” • “Best Guess”
  • 15.  Ensure consistency of data Data Preparation • Replacing missing or inconsistent values found in data should be done: • With intention, not haphazardly • Use common sense • Be transparent It is recommended to always document your missing or consistent data processes.
  • 16.  This course is a practical application course in Data Mining. Learning to use RapidMiner is required.  If you have not done so yet, please plan to walk through the tutorial examples in RapidMiner.  To assist you in understanding RapidMiner, I will take screenshots of what I am doing to get the results we are looking for.  RapidMiner is pretty intuitive. You will get it quickly. Basics of RapidMiner
  • 17.  Types of files that can be imported into RapidMiner: o CSV File o Excel File o XML File o Access Database Table o … and much more  We use mainly CSV files which contain Comma Separated Values- be mindful if your dataset contains commas o Alternative delimiters can be selected in this case:  Tab  Semicolon  Pipe ( l ), etc. Basics of RapidMiner
  • 18.  Three main areas that contain useful tools in RapidMiner: o Operators – Every possible task you can think of o Repositories – Where you store your data o Parameters – Task set up details Basics of RapidMiner
  • 26.  Explain concepts and purpose of Data Preparation  Understand solutions for handling missing and inconsistent data  Utilize data and attribute reduction techniques  Effectively work in RapidMiner to prepare your data. Summary
  • 27. “This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s Employment and Training Administration. The solution was created by the grantee and does not necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such information, including any information on linked sites and including, but not limited to, accuracy of the information or its completeness, timeliness, usefulness, adequacy, continued availability, or ownership.” Except where otherwise stated, this work by Wake Technical Community College Building Capacity in Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ Copyright Information