2. • Rapid Miner
– [Plain text URL:
https://rapidminer.com/products/studio/]
• RapidMiner Studio: 6.5 or greater
Software
Please follow link above to download the
free software.
3. • Define the discipline of Data Mining
• List and define various types of data
• List and define various sources of data
• Explain the fundamental differences
between databases, data warehouses,
and data sets
Learning Objectives (1 of 2)
4. • Explain some of the ethical dilemmas
associated with data mining and outline
possible solutions
• Explain the CRISP-DM Method
Learning Objectives (2 of 2)
6. • 15 out of 17 sectors in the United States have more data
stored per company than the US Library of Congress
• $5 million vs. $400: Price of the fastest supercomputer in
1975 and an iPhone with equal performance
• $600 to buy a disk drive that can store all of the world’s music
• 5 billion mobile phones in use in 2010
• 30 billion pieces of content shared on Facebook every month
• 40% projected growth in global data generated per year vs.
5% growth in IT spending
• 235 terabytes of data collected by the US Library of Congress
by April 2011
Why Data Mining?
7. Why Mine Data?
Lots of data is being collected and stored
Web data, e-commerce, point of sale
Credit card transactions, social media
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge
e.g. in Customer Relationship Management
Information is valuable and can be monetized
8. • Demand for deep analytical talent in the
United States could be 50-60% greater
than its projected supply by 2018
Demand for Data Mining
9. • Data contains value and knowledge, but to
extract the knowledge, data needs to be
– Stored
– Managed
– Analyzed this class
• Data Mining ≈ Big Data ≈
Predictive Analytics ≈ Data Science
Why Data Mining?
10. • “An interdisciplinary subfield of computer
science. It is the computational process of
discovering patterns in large data sets
involving methods at the intersection of
artificial intelligence, machine learning,
statistics, and database systems. The overall
goal of data mining is to extract information
from a data set and transform it into an
understandable structure for further use.”
– (Wikipedia) [Plain text URL:
https://en.wikipedia.org/wiki/Data_mining]
What is Data Mining? (1 of 4)
11. • Data Enormity Issue
• Discover patterns and models that are:
• Valid: data has some certainty
• Useful: should be able to act on the insight
• Unexpected: non-obvious to the system
• Understandable: humans can interpret the patterns
What is Data Mining? (2 of 4)
12. • Descriptive methods
– Find patterns that describe the data
• Example: Clustering with k-means
• Predictive methods
– Use target variables to predict unknown or
future values of other variables
• Example: Scoring with neural networks
What is Data Mining? (3 of 4)
13. What is Data Mining? (4 of 4)
Data Mining
• Prevalence of names in
US locations
• O’Brien, O’Rurke,
O’Reilly in Boston Area
• Group together similar
documents
• Returned by search
engine according to
their context
Not Data Mining
• Look up phone number
in a phone directory
• Query a web search
engine for information
about “Amazon”
16. • Scalability
• Dimensionality
• Complex and Heterogeneous Data
• Data Quality
• Data Ownership and Distribution
• Privacy Preservation
• Streaming Data
Challenges of Data Mining
17. • Garbage In, Garbage Out (GIGO)
– Collected incorrectly
– Out-of-date
• Day-to-day:
– Use available resources
– Acceptable risk
– Professional experience
– Common sense
Limits to Data Mining
18. • A risk with “Data mining” is that an analyst can
“discover” patterns that are meaningless.
• Statisticians call it Bonferroni’s principle:
– “If you look in more places for interesting patterns
than your amount of data will support, you are bound
to find crap”
Meaningfulness of Analytic Answers (1 of 2)
18
19. Meaningfulness of Analytic Answers (2 of 2)
19
National Security Agency example:
“We consider suspicious when a pair of (unrelated) people stayed at least twice
in the same hotel on the same day”
◦ Suppose 1 billion people tracked during 1,000 days
◦ Each person stays in a hotel 1% of the time (1 day out of 100)
◦ Each hotel holds 100 people (so need 100,000 hotels)
“If everyone behaves randomly (i.e. no terrorist), can we still detect something
suspicious?”
• Probability that a specific pair of people visit same hotel on same day is 10-9
• Probability this happens twice is 10-18 (really, really, really small)
Expected number of “suspicious” pairs is actually about 250,000!
Example taken from Rajamaran et al., Mining of Massive Datasets
20. • To mine different types of data:
– Data is highly dimensional
– Data is a graph
– Data is infinite / never-ending
– Data is labeled
What will we learn? (1 of 4)
20
21. • To solve real-world problems:
– Market basket analysis
– Customer segmentation
– Forecasting new product demand
– Evaluating athletic talent
– Probabilities of a health risk
– Text sentiment analysis
What will we learn? (2 of 4)
21
22. • Use of various “tools”:
– Association Rules
– Clustering with K-means
– Logistic and Linear regression
– Decision Trees
– Neural Networks
– Text Mining
What will we learn? (3 of 4)
22
23. • Regression
• Decision Trees
• Cluster Analysis
• Text Mining
• Ensemble Models
• Neural Nets
• Association Rules
What will we learn? (4 of 4)
25. • Consider the real people behind the data
• Ethical and moral obligations
• Protect against crimes including identity
theft
• Objectives should never justify unethical
means
Privacy & Security
26. • Things to consider in data mining efforts:
– Protection of privacy
– Respect for individual rights
– Willingness to embrace transparency of
actions and methods
– Ask for permission to gather and use data
– Ensure you are doing fair and just work that
will help and benefit others
Privacy & Security (1 of 2)
27. • We can protect privacy by:
– Aggregating data
– Anonymizing observations through removal of
names and personally identifiable information
(PII)
– Storing data in secure and protected
environments
Privacy & Security (2 of 2)
29. • Organized grouping of information within a specific structure
• Table - a database container made
• Relational databases more common today
– Relate tables to one another in a logical fashion
– Tables are broken apart to reduce redundancy through normalization
Database
30. • Handles high volume of reads and writes
• Not efficient for analysis due to lengthy
retrieval of data
– Must use a query containing joins
– Intensive and time consuming
Online Transactional Processing (OLTP)
31. • Denormalized to intentionally combine
multiple tables into a single table
– Results in duplicate data in some columns
– Reduces number of joins necessary to query
related data
– Online Analytical Processing (OLAP)
Data Warehouse (1 of 2)
32. Data Warehouse (2 of 2)
• Contain archived data copied from transactional database
o Can become out-of-sync if source data is updated
• Can contain data moved from transactional system
o Data may be unavailable for updates or viewing
33. • Organizational data store created in
conjunction to meet needs of specific
business unit
• One-stop shop
• Must be known, current, accurate, and
well-managed (privacy and security)
Data Mart
34. • Subset of a database or data warehouse
• Usually denormalized
• Typically related to a specific:
– Business question
– Business problem
– Business unit
Data Set
35. • Database
– Rows = Records
– Columns = Fields
• RapidMiner
– Rows = Examples
• Data Warehouses and Data Sets
– Rows = Observations, Examples, or Cases
– Columns = Variables or Attributes
Rows and Columns
36. The Data Mining Process
• For this course, we will channel every
homework assignment through the
CRISP-DM process.
38. – Define the questions you want to answer.
– Who will you work with to understand the
issue?
– Design what you are going to build.
– Get buy-in of the problem to be solved
1. Business Understanding
39. – What is the source of the data?
– How was it collected?
– How accurate or reliable is it?
– What are the correct variables to collect?
2. Data Understanding
40. – Join necessary data sets
– Reduce data sets to only include pertinent
variables
– Scrub data to remove anomalies- outliers or
missing data
– Reformat for consistency
3. Data Preparation
41. – Two types:
• Classification (Descriptive)
• Prediction
– Can be overlapping (Decision Trees)
– Note: We will spend most of our time in this
step
4. Modeling
42. – Is the insight useful?
• Should another technique be used?
– What can be done with the results?
– Testing for false positives
– Human experience and operational
knowledge
5. Evaluation
43. – Automation of model
– Communication with end-users
– Integration with existing systems
– Continuous monitoring and gaining feedback
for improvement (fine-tuning)
6. Deployment
44. • Clearly communicate model’s:
– Function
– Utility to stakeholders
• Thoroughly test and prove the model
• Plan for and monitor implementation
Keys to Successful Deployment
46. • Data mining is the statistical and logical methods of analysis
to describe large data sets and create predictive models to
uncover insights
• Databases, data warehouses, and data sets are unique kinds
of digital record keeping systems with some similarities
• Data mining is most effective on data sets extracted from
OLAP rather than OLTP
• Data is highly dimensional and has inherent risks, such as
quality
• Remember human factor behind manipulation of numbers
and figures- ethical responsibilities
• CRISP-DM is the most used standard method for analysis
Summary
47. “This workforce solution was funded by a grant awarded by the U.S. Department of
Labor’s Employment and Training Administration. The solution was created by the
grantee and does not necessarily reflect the official position of the U.S. Department of
Labor. The Department of Labor makes no guarantees, warranties, or assurances of any
kind, express or implied, with respect to such information, including any information on
linked sites and including, but not limited to, accuracy of the information or its
completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building
Capacity in Business Analytics, a Department of Labor, TAACCCT funded project, is
licensed under the Creative Commons Attribution 4.0 International License. To view a
copy of this license, visit http://creativecommons.org/licenses/by/4.0/
Copyright Information