SlideShare uma empresa Scribd logo
1 de 47
BAS 250
Lesson 1: Introduction to Data Mining
• Rapid Miner
– [Plain text URL:
https://rapidminer.com/products/studio/]
• RapidMiner Studio: 6.5 or greater
Software
Please follow link above to download the
free software.
• Define the discipline of Data Mining
• List and define various types of data
• List and define various sources of data
• Explain the fundamental differences
between databases, data warehouses,
and data sets
Learning Objectives (1 of 2)
• Explain some of the ethical dilemmas
associated with data mining and outline
possible solutions
• Explain the CRISP-DM Method
Learning Objectives (2 of 2)
Data Mining
• 15 out of 17 sectors in the United States have more data
stored per company than the US Library of Congress
• $5 million vs. $400: Price of the fastest supercomputer in
1975 and an iPhone with equal performance
• $600 to buy a disk drive that can store all of the world’s music
• 5 billion mobile phones in use in 2010
• 30 billion pieces of content shared on Facebook every month
• 40% projected growth in global data generated per year vs.
5% growth in IT spending
• 235 terabytes of data collected by the US Library of Congress
by April 2011
Why Data Mining?
Why Mine Data?
 Lots of data is being collected and stored
 Web data, e-commerce, point of sale
 Credit card transactions, social media
 Computers have become cheaper and more powerful
 Competitive Pressure is Strong
 Provide better, customized services for an edge
 e.g. in Customer Relationship Management
 Information is valuable and can be monetized
• Demand for deep analytical talent in the
United States could be 50-60% greater
than its projected supply by 2018
Demand for Data Mining
• Data contains value and knowledge, but to
extract the knowledge, data needs to be
– Stored
– Managed
– Analyzed  this class
• Data Mining ≈ Big Data ≈
Predictive Analytics ≈ Data Science
Why Data Mining?
• “An interdisciplinary subfield of computer
science. It is the computational process of
discovering patterns in large data sets
involving methods at the intersection of
artificial intelligence, machine learning,
statistics, and database systems. The overall
goal of data mining is to extract information
from a data set and transform it into an
understandable structure for further use.”
– (Wikipedia) [Plain text URL:
https://en.wikipedia.org/wiki/Data_mining]
What is Data Mining? (1 of 4)
• Data Enormity Issue
• Discover patterns and models that are:
• Valid: data has some certainty
• Useful: should be able to act on the insight
• Unexpected: non-obvious to the system
• Understandable: humans can interpret the patterns
What is Data Mining? (2 of 4)
• Descriptive methods
– Find patterns that describe the data
• Example: Clustering with k-means
• Predictive methods
– Use target variables to predict unknown or
future values of other variables
• Example: Scoring with neural networks
What is Data Mining? (3 of 4)
What is Data Mining? (4 of 4)
Data Mining
• Prevalence of names in
US locations
• O’Brien, O’Rurke,
O’Reilly in Boston Area
• Group together similar
documents
• Returned by search
engine according to
their context
Not Data Mining
• Look up phone number
in a phone directory
• Query a web search
engine for information
about “Amazon”
Origins of Data Mining
Data is Highly Dimensional
• Scalability
• Dimensionality
• Complex and Heterogeneous Data
• Data Quality
• Data Ownership and Distribution
• Privacy Preservation
• Streaming Data
Challenges of Data Mining
• Garbage In, Garbage Out (GIGO)
– Collected incorrectly
– Out-of-date
• Day-to-day:
– Use available resources
– Acceptable risk
– Professional experience
– Common sense
Limits to Data Mining
• A risk with “Data mining” is that an analyst can
“discover” patterns that are meaningless.
• Statisticians call it Bonferroni’s principle:
– “If you look in more places for interesting patterns
than your amount of data will support, you are bound
to find crap”
Meaningfulness of Analytic Answers (1 of 2)
18
Meaningfulness of Analytic Answers (2 of 2)
19
National Security Agency example:
“We consider suspicious when a pair of (unrelated) people stayed at least twice
in the same hotel on the same day”
◦ Suppose 1 billion people tracked during 1,000 days
◦ Each person stays in a hotel 1% of the time (1 day out of 100)
◦ Each hotel holds 100 people (so need 100,000 hotels)
“If everyone behaves randomly (i.e. no terrorist), can we still detect something
suspicious?”
• Probability that a specific pair of people visit same hotel on same day is 10-9
• Probability this happens twice is 10-18 (really, really, really small)
 Expected number of “suspicious” pairs is actually about 250,000!
Example taken from Rajamaran et al., Mining of Massive Datasets
• To mine different types of data:
– Data is highly dimensional
– Data is a graph
– Data is infinite / never-ending
– Data is labeled
What will we learn? (1 of 4)
20
• To solve real-world problems:
– Market basket analysis
– Customer segmentation
– Forecasting new product demand
– Evaluating athletic talent
– Probabilities of a health risk
– Text sentiment analysis
What will we learn? (2 of 4)
21
• Use of various “tools”:
– Association Rules
– Clustering with K-means
– Logistic and Linear regression
– Decision Trees
– Neural Networks
– Text Mining
What will we learn? (3 of 4)
22
• Regression
• Decision Trees
• Cluster Analysis
• Text Mining
• Ensemble Models
• Neural Nets
• Association Rules
What will we learn? (4 of 4)
Privacy & Security
• Consider the real people behind the data
• Ethical and moral obligations
• Protect against crimes including identity
theft
• Objectives should never justify unethical
means
Privacy & Security
• Things to consider in data mining efforts:
– Protection of privacy
– Respect for individual rights
– Willingness to embrace transparency of
actions and methods
– Ask for permission to gather and use data
– Ensure you are doing fair and just work that
will help and benefit others
Privacy & Security (1 of 2)
• We can protect privacy by:
– Aggregating data
– Anonymizing observations through removal of
names and personally identifiable information
(PII)
– Storing data in secure and protected
environments
Privacy & Security (2 of 2)
Database, Data Warehouse,
Data Mart, Data Set
• Organized grouping of information within a specific structure
• Table - a database container made
• Relational databases more common today
– Relate tables to one another in a logical fashion
– Tables are broken apart to reduce redundancy through normalization
Database
• Handles high volume of reads and writes
• Not efficient for analysis due to lengthy
retrieval of data
– Must use a query containing joins
– Intensive and time consuming
Online Transactional Processing (OLTP)
• Denormalized to intentionally combine
multiple tables into a single table
– Results in duplicate data in some columns
– Reduces number of joins necessary to query
related data
– Online Analytical Processing (OLAP)
Data Warehouse (1 of 2)
Data Warehouse (2 of 2)
• Contain archived data copied from transactional database
o Can become out-of-sync if source data is updated
• Can contain data moved from transactional system
o Data may be unavailable for updates or viewing
• Organizational data store created in
conjunction to meet needs of specific
business unit
• One-stop shop
• Must be known, current, accurate, and
well-managed (privacy and security)
Data Mart
• Subset of a database or data warehouse
• Usually denormalized
• Typically related to a specific:
– Business question
– Business problem
– Business unit
Data Set
• Database
– Rows = Records
– Columns = Fields
• RapidMiner
– Rows = Examples
• Data Warehouses and Data Sets
– Rows = Observations, Examples, or Cases
– Columns = Variables or Attributes
Rows and Columns
The Data Mining Process
• For this course, we will channel every
homework assignment through the
CRISP-DM process.
CRISP-DM
– Define the questions you want to answer.
– Who will you work with to understand the
issue?
– Design what you are going to build.
– Get buy-in of the problem to be solved
1. Business Understanding
– What is the source of the data?
– How was it collected?
– How accurate or reliable is it?
– What are the correct variables to collect?
2. Data Understanding
– Join necessary data sets
– Reduce data sets to only include pertinent
variables
– Scrub data to remove anomalies- outliers or
missing data
– Reformat for consistency
3. Data Preparation
– Two types:
• Classification (Descriptive)
• Prediction
– Can be overlapping (Decision Trees)
– Note: We will spend most of our time in this
step
4. Modeling
– Is the insight useful?
• Should another technique be used?
– What can be done with the results?
– Testing for false positives
– Human experience and operational
knowledge
5. Evaluation
– Automation of model
– Communication with end-users
– Integration with existing systems
– Continuous monitoring and gaining feedback
for improvement (fine-tuning)
6. Deployment
• Clearly communicate model’s:
– Function
– Utility to stakeholders
• Thoroughly test and prove the model
• Plan for and monitor implementation
Keys to Successful Deployment
Summary
• Data mining is the statistical and logical methods of analysis
to describe large data sets and create predictive models to
uncover insights
• Databases, data warehouses, and data sets are unique kinds
of digital record keeping systems with some similarities
• Data mining is most effective on data sets extracted from
OLAP rather than OLTP
• Data is highly dimensional and has inherent risks, such as
quality
• Remember human factor behind manipulation of numbers
and figures- ethical responsibilities
• CRISP-DM is the most used standard method for analysis
Summary
“This workforce solution was funded by a grant awarded by the U.S. Department of
Labor’s Employment and Training Administration. The solution was created by the
grantee and does not necessarily reflect the official position of the U.S. Department of
Labor. The Department of Labor makes no guarantees, warranties, or assurances of any
kind, express or implied, with respect to such information, including any information on
linked sites and including, but not limited to, accuracy of the information or its
completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building
Capacity in Business Analytics, a Department of Labor, TAACCCT funded project, is
licensed under the Creative Commons Attribution 4.0 International License. To view a
copy of this license, visit http://creativecommons.org/licenses/by/4.0/
Copyright Information

Mais conteúdo relacionado

Mais procurados

Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
Thanveen
 

Mais procurados (20)

Data mining and knowledge Discovery
Data mining and knowledge DiscoveryData mining and knowledge Discovery
Data mining and knowledge Discovery
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
 
Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques
 
Introduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in DatabaseIntroduction-to-Knowledge Discovery in Database
Introduction-to-Knowledge Discovery in Database
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Data mining
Data miningData mining
Data mining
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 
Data mining
Data miningData mining
Data mining
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and Techniques
 
Discovery informaticsstanton
Discovery informaticsstantonDiscovery informaticsstanton
Discovery informaticsstanton
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)
 
Data mining
Data miningData mining
Data mining
 
Data Mining
Data MiningData Mining
Data Mining
 

Destaque

Base 9.1 preparation guide
Base 9.1 preparation guideBase 9.1 preparation guide
Base 9.1 preparation guide
imaduddin91
 
The Second Little Book of Leadership
The Second Little Book of LeadershipThe Second Little Book of Leadership
The Second Little Book of Leadership
Phil Dourado
 

Destaque (14)

BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
 
BAS 150 Lesson 2 Lecture
BAS 150 Lesson 2 Lecture BAS 150 Lesson 2 Lecture
BAS 150 Lesson 2 Lecture
 
BAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 LectureBAS 150 Lesson 1 Lecture
BAS 150 Lesson 1 Lecture
 
BAS 250 Lecture 2
BAS 250 Lecture 2BAS 250 Lecture 2
BAS 250 Lecture 2
 
Base 9.1 preparation guide
Base 9.1 preparation guideBase 9.1 preparation guide
Base 9.1 preparation guide
 
Analytics with SAS
Analytics with SASAnalytics with SAS
Analytics with SAS
 
Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
Learning SAS by Example -A Programmer’s Guide by Ron CodySolutionLearning SAS by Example -A Programmer’s Guide by Ron CodySolution
Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
 
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | EdurekaBig Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 
The Second Little Book of Leadership
The Second Little Book of LeadershipThe Second Little Book of Leadership
The Second Little Book of Leadership
 
Best Presentation About Infosys
Best Presentation About InfosysBest Presentation About Infosys
Best Presentation About Infosys
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through Examples
 

Semelhante a BAS 250 Lecture 1

Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
Sulman Ahmed
 
Business Analytics and Data mining.pdf
Business Analytics and Data mining.pdfBusiness Analytics and Data mining.pdf
Business Analytics and Data mining.pdf
ssuser0413ec
 

Semelhante a BAS 250 Lecture 1 (20)

Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
Information_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.pptInformation_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.ppt
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
DBMS
DBMSDBMS
DBMS
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Introduction Data Science.pptx
Introduction Data Science.pptxIntroduction Data Science.pptx
Introduction Data Science.pptx
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basic
 
Data mining basic concept and Data warehousing
Data mining basic concept and Data warehousingData mining basic concept and Data warehousing
Data mining basic concept and Data warehousing
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Business Analytics and Data mining.pdf
Business Analytics and Data mining.pdfBusiness Analytics and Data mining.pdf
Business Analytics and Data mining.pdf
 
NCCT.pptx
NCCT.pptxNCCT.pptx
NCCT.pptx
 

Mais de Wake Tech BAS (9)

BAS 250 Lecture 5
BAS 250 Lecture 5BAS 250 Lecture 5
BAS 250 Lecture 5
 
BAS 250 Lecture 4
BAS 250 Lecture 4BAS 250 Lecture 4
BAS 250 Lecture 4
 
BAS 250 Lecture 3
BAS 250 Lecture 3BAS 250 Lecture 3
BAS 250 Lecture 3
 
BAS 150 Lesson 8 Lecture
BAS 150 Lesson 8 LectureBAS 150 Lesson 8 Lecture
BAS 150 Lesson 8 Lecture
 
BAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 LectureBAS 150 Lesson 7 Lecture
BAS 150 Lesson 7 Lecture
 
BAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 LectureBAS 150 Lesson 6 Lecture
BAS 150 Lesson 6 Lecture
 
BAS 150 Lesson 5 Lecture
BAS 150 Lesson 5 LectureBAS 150 Lesson 5 Lecture
BAS 150 Lesson 5 Lecture
 
BAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 LectureBAS 150 Lesson 4 Lecture
BAS 150 Lesson 4 Lecture
 
BAS 150 Lesson 3 Lecture
BAS 150 Lesson 3 LectureBAS 150 Lesson 3 Lecture
BAS 150 Lesson 3 Lecture
 

Último

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Último (20)

Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 

BAS 250 Lecture 1

  • 1. BAS 250 Lesson 1: Introduction to Data Mining
  • 2. • Rapid Miner – [Plain text URL: https://rapidminer.com/products/studio/] • RapidMiner Studio: 6.5 or greater Software Please follow link above to download the free software.
  • 3. • Define the discipline of Data Mining • List and define various types of data • List and define various sources of data • Explain the fundamental differences between databases, data warehouses, and data sets Learning Objectives (1 of 2)
  • 4. • Explain some of the ethical dilemmas associated with data mining and outline possible solutions • Explain the CRISP-DM Method Learning Objectives (2 of 2)
  • 6. • 15 out of 17 sectors in the United States have more data stored per company than the US Library of Congress • $5 million vs. $400: Price of the fastest supercomputer in 1975 and an iPhone with equal performance • $600 to buy a disk drive that can store all of the world’s music • 5 billion mobile phones in use in 2010 • 30 billion pieces of content shared on Facebook every month • 40% projected growth in global data generated per year vs. 5% growth in IT spending • 235 terabytes of data collected by the US Library of Congress by April 2011 Why Data Mining?
  • 7. Why Mine Data?  Lots of data is being collected and stored  Web data, e-commerce, point of sale  Credit card transactions, social media  Computers have become cheaper and more powerful  Competitive Pressure is Strong  Provide better, customized services for an edge  e.g. in Customer Relationship Management  Information is valuable and can be monetized
  • 8. • Demand for deep analytical talent in the United States could be 50-60% greater than its projected supply by 2018 Demand for Data Mining
  • 9. • Data contains value and knowledge, but to extract the knowledge, data needs to be – Stored – Managed – Analyzed  this class • Data Mining ≈ Big Data ≈ Predictive Analytics ≈ Data Science Why Data Mining?
  • 10. • “An interdisciplinary subfield of computer science. It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of data mining is to extract information from a data set and transform it into an understandable structure for further use.” – (Wikipedia) [Plain text URL: https://en.wikipedia.org/wiki/Data_mining] What is Data Mining? (1 of 4)
  • 11. • Data Enormity Issue • Discover patterns and models that are: • Valid: data has some certainty • Useful: should be able to act on the insight • Unexpected: non-obvious to the system • Understandable: humans can interpret the patterns What is Data Mining? (2 of 4)
  • 12. • Descriptive methods – Find patterns that describe the data • Example: Clustering with k-means • Predictive methods – Use target variables to predict unknown or future values of other variables • Example: Scoring with neural networks What is Data Mining? (3 of 4)
  • 13. What is Data Mining? (4 of 4) Data Mining • Prevalence of names in US locations • O’Brien, O’Rurke, O’Reilly in Boston Area • Group together similar documents • Returned by search engine according to their context Not Data Mining • Look up phone number in a phone directory • Query a web search engine for information about “Amazon”
  • 14. Origins of Data Mining
  • 15. Data is Highly Dimensional
  • 16. • Scalability • Dimensionality • Complex and Heterogeneous Data • Data Quality • Data Ownership and Distribution • Privacy Preservation • Streaming Data Challenges of Data Mining
  • 17. • Garbage In, Garbage Out (GIGO) – Collected incorrectly – Out-of-date • Day-to-day: – Use available resources – Acceptable risk – Professional experience – Common sense Limits to Data Mining
  • 18. • A risk with “Data mining” is that an analyst can “discover” patterns that are meaningless. • Statisticians call it Bonferroni’s principle: – “If you look in more places for interesting patterns than your amount of data will support, you are bound to find crap” Meaningfulness of Analytic Answers (1 of 2) 18
  • 19. Meaningfulness of Analytic Answers (2 of 2) 19 National Security Agency example: “We consider suspicious when a pair of (unrelated) people stayed at least twice in the same hotel on the same day” ◦ Suppose 1 billion people tracked during 1,000 days ◦ Each person stays in a hotel 1% of the time (1 day out of 100) ◦ Each hotel holds 100 people (so need 100,000 hotels) “If everyone behaves randomly (i.e. no terrorist), can we still detect something suspicious?” • Probability that a specific pair of people visit same hotel on same day is 10-9 • Probability this happens twice is 10-18 (really, really, really small)  Expected number of “suspicious” pairs is actually about 250,000! Example taken from Rajamaran et al., Mining of Massive Datasets
  • 20. • To mine different types of data: – Data is highly dimensional – Data is a graph – Data is infinite / never-ending – Data is labeled What will we learn? (1 of 4) 20
  • 21. • To solve real-world problems: – Market basket analysis – Customer segmentation – Forecasting new product demand – Evaluating athletic talent – Probabilities of a health risk – Text sentiment analysis What will we learn? (2 of 4) 21
  • 22. • Use of various “tools”: – Association Rules – Clustering with K-means – Logistic and Linear regression – Decision Trees – Neural Networks – Text Mining What will we learn? (3 of 4) 22
  • 23. • Regression • Decision Trees • Cluster Analysis • Text Mining • Ensemble Models • Neural Nets • Association Rules What will we learn? (4 of 4)
  • 25. • Consider the real people behind the data • Ethical and moral obligations • Protect against crimes including identity theft • Objectives should never justify unethical means Privacy & Security
  • 26. • Things to consider in data mining efforts: – Protection of privacy – Respect for individual rights – Willingness to embrace transparency of actions and methods – Ask for permission to gather and use data – Ensure you are doing fair and just work that will help and benefit others Privacy & Security (1 of 2)
  • 27. • We can protect privacy by: – Aggregating data – Anonymizing observations through removal of names and personally identifiable information (PII) – Storing data in secure and protected environments Privacy & Security (2 of 2)
  • 29. • Organized grouping of information within a specific structure • Table - a database container made • Relational databases more common today – Relate tables to one another in a logical fashion – Tables are broken apart to reduce redundancy through normalization Database
  • 30. • Handles high volume of reads and writes • Not efficient for analysis due to lengthy retrieval of data – Must use a query containing joins – Intensive and time consuming Online Transactional Processing (OLTP)
  • 31. • Denormalized to intentionally combine multiple tables into a single table – Results in duplicate data in some columns – Reduces number of joins necessary to query related data – Online Analytical Processing (OLAP) Data Warehouse (1 of 2)
  • 32. Data Warehouse (2 of 2) • Contain archived data copied from transactional database o Can become out-of-sync if source data is updated • Can contain data moved from transactional system o Data may be unavailable for updates or viewing
  • 33. • Organizational data store created in conjunction to meet needs of specific business unit • One-stop shop • Must be known, current, accurate, and well-managed (privacy and security) Data Mart
  • 34. • Subset of a database or data warehouse • Usually denormalized • Typically related to a specific: – Business question – Business problem – Business unit Data Set
  • 35. • Database – Rows = Records – Columns = Fields • RapidMiner – Rows = Examples • Data Warehouses and Data Sets – Rows = Observations, Examples, or Cases – Columns = Variables or Attributes Rows and Columns
  • 36. The Data Mining Process • For this course, we will channel every homework assignment through the CRISP-DM process.
  • 38. – Define the questions you want to answer. – Who will you work with to understand the issue? – Design what you are going to build. – Get buy-in of the problem to be solved 1. Business Understanding
  • 39. – What is the source of the data? – How was it collected? – How accurate or reliable is it? – What are the correct variables to collect? 2. Data Understanding
  • 40. – Join necessary data sets – Reduce data sets to only include pertinent variables – Scrub data to remove anomalies- outliers or missing data – Reformat for consistency 3. Data Preparation
  • 41. – Two types: • Classification (Descriptive) • Prediction – Can be overlapping (Decision Trees) – Note: We will spend most of our time in this step 4. Modeling
  • 42. – Is the insight useful? • Should another technique be used? – What can be done with the results? – Testing for false positives – Human experience and operational knowledge 5. Evaluation
  • 43. – Automation of model – Communication with end-users – Integration with existing systems – Continuous monitoring and gaining feedback for improvement (fine-tuning) 6. Deployment
  • 44. • Clearly communicate model’s: – Function – Utility to stakeholders • Thoroughly test and prove the model • Plan for and monitor implementation Keys to Successful Deployment
  • 46. • Data mining is the statistical and logical methods of analysis to describe large data sets and create predictive models to uncover insights • Databases, data warehouses, and data sets are unique kinds of digital record keeping systems with some similarities • Data mining is most effective on data sets extracted from OLAP rather than OLTP • Data is highly dimensional and has inherent risks, such as quality • Remember human factor behind manipulation of numbers and figures- ethical responsibilities • CRISP-DM is the most used standard method for analysis Summary
  • 47. “This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s Employment and Training Administration. The solution was created by the grantee and does not necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such information, including any information on linked sites and including, but not limited to, accuracy of the information or its completeness, timeliness, usefulness, adequacy, continued availability, or ownership.” Except where otherwise stated, this work by Wake Technical Community College Building Capacity in Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ Copyright Information