SlideShare uma empresa Scribd logo
1 de 24
1
From Data to Wisdom
 Data
 The raw material of
information
 Information
 Data organized and
presented by someone
 Knowledge
 Information read, heard or
seen and understood and
integrated
 Wisdom
 Distilled knowledge and
understanding which can
lead to decisions
Wisdom
Knowledge
Information
Data
The Information Hierarchy
Why Data Mining?
The Explosive Growth of Data: from terabytes to
petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, images, video, documents
Internet …
2
3
Source: Intel
How much data?
 Google: ~20-30 PB a day
 Wayback Machine has ~4 PB + 100-200 TB/month
 Facebook: ~3 PB of user data + 25 TB/day
 eBay: ~7 PB of user data + 50 TB/day
 CERN’s Large Hydron Collider generates 15 PB a year
 In 2010, enterprises stored 7 Exabytes = 7,000,000,000 GB
640K ought to be
enough for anybody.
Big Data Growing
5
The Untapped Data Gap:
Most of the useful data will
not be tagged or analyzed –
partly due to skill shortage
IDC predicts: From 2005 to 2020, the
digital universe will double every 2
years and grow from 130 exabytes to
40,000 exabytes
or 5,200 GB / person in 2020.
What Is Data Mining?
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
6
The non-trivial extraction of implicit, previously unknown and
potentially useful knowledge from data in large data repositories
 Data Mining: A Definition
 Non-trivial: obvious knowledge is not useful
 implicit: hidden difficult to observe knowledge
 previously unknown
 potentially useful: actionable; easy to understand
7
Data Mining: Confluence of Multiple Disciplines
Data Mining
Machine
Learning
Statistics
Applications
Algorithm
Pattern
Recognition
High-Performance
Computing
Visualization
Database
Technology
8
Data Mining’s Virtuous Cycle
1. Identifying the problem
2. Mining data to transform it into actionable
information
3. Acting on the information
4. Measuring the results
9
The Knowledge Discovery Process
 Data Mining v. Knowledge Discovery in Databases (KDD)
 DM and KDD are often used interchangeably
 actually, DM is only part of the KDD process
- The KDD Process
10
Types of Knowledge Discovery
 Two kinds of knowledge discovery: directed and undirected
 Directed Knowledge Discovery
 Purpose: Explain value of some field in terms of all the others (goal-oriented)
 Method: select the target field based on some hypothesis about the data; ask the
algorithm to tell us how to predict or classify new instances
 Examples:
what products show increased sale when cream cheese is discounted
which banner ad to use on a web page for a given user coming to the site
 Undirected Knowledge Discovery
 Purpose: Find patterns in the data that may be interesting (no target field)
 Method: clustering, affinity grouping
 Examples:
which products in the catalog often sell together
market segmentation (find groups of customers/users with similar
characteristics or behavioral patterns)
From Data Mining to Data Science
11
12
Data Mining: On What Kinds of Data?
 Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Object-relational databases, Heterogeneous databases and legacy databases
 Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and information networks
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
13
Data Mining: What Kind of Data?
Structured Databases
relational, object-relational, etc.
can use SQL to perform parts of the process
e.g., SELECT count(*) FROM Items WHERE
type=video GROUP BY category
14
Data Mining: What Kind of Data?
 Flat Files
 most common data source
 can be text (or HTML) or binary
 may contain transactions, statistical data, measurements, etc.
 Transactional databases
 set of records each with a transaction id, time stamp, and a set of items
 may have an associated “description” file for the items
 typical source of data used in market basket analysis
15
Data Mining: What Kind of Data?
 Other Types of Databases
 legacy databases
 multimedia databases (usually very high-dimensional)
 spatial databases (containing geographical information, such as maps, or
satellite imaging data, etc.)
 Time Series Temporal Data (time dependent information such as stock market
data; usually very dynamic)
 World Wide Web
 basically a large, heterogeneous, distributed database
 need for new or additional tools and techniques
information retrieval, filtering and extraction
agents to assist in browsing and filtering
Web content, usage, and structure (linkage) mining tools
 The “social Web”
User generated meta-data, social networks, shared resources, etc.
16
What Can Data Mining Do
Many Data Mining Tasks
 often inter-related
 often need to try different techniques/algorithms for each task
 each tasks may require different types of knowledge discovery
What are some of data mining tasks
 Classification
 Prediction
 Clustering
 Affinity Grouping / Association discovery
 Sequence Analysis
 Characterization
 Discrimination
17
Some Applications of Data mining
 Business data analysis and decision support
Marketing focalization
Recognizing specific market segments that respond to particular
characteristics
Return on mailing campaign (target marketing)
Customer Profiling
Segmentation of customer for marketing strategies and/or product
offerings
Customer behavior understanding
Customer retention and loyalty
Mass customization / personalization
18
Some Applications of Data mining
 Business data analysis and decision support (cont.)
Market analysis and management
Provide summary information for decision-making
Market basket analysis, cross selling, market segmentation.
Resource planning
Risk analysis and management
"What if" analysis
Forecasting
Pricing analysis, competitive analysis
Time-series analysis (Ex. stock market)
19
Some Applications of Data mining
 Fraud detection
Detecting telephone fraud:
Telephone call model: destination of the call, duration, time of day or week
Analyze patterns that deviate from an expected norm
British Telecom identified discrete groups of callers with frequent intra-group calls,
especially mobile phones, and broke a multimillion dollar fraud scheme
Detection of credit-card fraud
Detecting suspicious money transactions (money laundering)
 Text mining:
 Message filtering (e-mail, newsgroups, etc.)
 Newspaper articles analysis
 Text and document categorization
 Web Mining
 Mining patterns from the content, usage, and structure of Web resources
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
20
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
21
Applications:
• document clustering or
categorization
• topic identification / tracking
• concept discovery
• focused crawling
• content-based personalization
• intelligent search tools
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
Applications:
• user and customer behavior modeling
• Web site optimization
• e-customer relationship management
• Web marketing
• targeted advertising
• recommender systems
22
Types of Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Mining
Applications:
• document retrieval and
ranking (e.g., Google)
• discovery of “hubs” and
“authorities”
• discovery of Web
communities
• social network analysis
23
24
The Knowledge Discovery Process
- The KDD Process
 Next: We first focus on understanding the data and data
preparation/transformation

Mais conteúdo relacionado

Mais procurados

Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
 
Data warehouse
Data warehouseData warehouse
Data warehouse
MR Z
 

Mais procurados (20)

Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Data Mining : Healthcare Application
Data Mining : Healthcare ApplicationData Mining : Healthcare Application
Data Mining : Healthcare Application
 
Data science
Data scienceData science
Data science
 
Big data visualization
Big data visualizationBig data visualization
Big data visualization
 
data mining
data miningdata mining
data mining
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Big data architecture
Big data architectureBig data architecture
Big data architecture
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Datawarehouse and OLAP
Datawarehouse and OLAPDatawarehouse and OLAP
Datawarehouse and OLAP
 
Business intelligence concepts & application
Business intelligence concepts & applicationBusiness intelligence concepts & application
Business intelligence concepts & application
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Big Data Analytics for Banking, a Point of View
Big Data Analytics for Banking, a Point of ViewBig Data Analytics for Banking, a Point of View
Big Data Analytics for Banking, a Point of View
 
Big Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesBig Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation Slides
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 

Destaque

Destaque (7)

Agriculture in usa
Agriculture in usaAgriculture in usa
Agriculture in usa
 
Extension system of usa
Extension system of usaExtension system of usa
Extension system of usa
 
Agriculture in Latin America and the Caribbean 1981-2012
Agriculture in Latin America and the Caribbean 1981-2012Agriculture in Latin America and the Caribbean 1981-2012
Agriculture in Latin America and the Caribbean 1981-2012
 
Agriculture in the us and canada powerpoint feb 2011
Agriculture in the us and canada powerpoint feb 2011Agriculture in the us and canada powerpoint feb 2011
Agriculture in the us and canada powerpoint feb 2011
 
united states of america .ppt
united states of america .pptunited states of america .ppt
united states of america .ppt
 
Agriculture PPT
Agriculture PPTAgriculture PPT
Agriculture PPT
 
agriculture ppt
 agriculture ppt agriculture ppt
agriculture ppt
 

Semelhante a Data mining and knowledge discovery

Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data mining
Rohit Kumar
 
Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.ppt
PadmajaLaksh
 

Semelhante a Data mining and knowledge discovery (20)

Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data mining
 
Introduction
IntroductionIntroduction
Introduction
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana
 
6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar
 
6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana
 
6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar
 
Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.ppt
 
Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.ppt
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)Data mining 1 - Introduction (cheat sheet - printable)
Data mining 1 - Introduction (cheat sheet - printable)
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
Data mining
Data miningData mining
Data mining
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
data mining
data miningdata mining
data mining
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 

Mais de Hoang Nguyen

Mais de Hoang Nguyen (20)

Rest api to integrate with your site
Rest api to integrate with your siteRest api to integrate with your site
Rest api to integrate with your site
 
How to build a rest api
How to build a rest apiHow to build a rest api
How to build a rest api
 
Api crash
Api crashApi crash
Api crash
 
Smm and caching
Smm and cachingSmm and caching
Smm and caching
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Cache recap
Cache recapCache recap
Cache recap
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python basics
Python basicsPython basics
Python basics
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in python
 
Learning python
Learning pythonLearning python
Learning python
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with python
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Object oriented programming using c++
Object oriented programming using c++Object oriented programming using c++
Object oriented programming using c++
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Object model
Object modelObject model
Object model
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 

Último

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Data mining and knowledge discovery

  • 1. 1 From Data to Wisdom  Data  The raw material of information  Information  Data organized and presented by someone  Knowledge  Information read, heard or seen and understood and integrated  Wisdom  Distilled knowledge and understanding which can lead to decisions Wisdom Knowledge Information Data The Information Hierarchy
  • 2. Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, images, video, documents Internet … 2
  • 4. How much data?  Google: ~20-30 PB a day  Wayback Machine has ~4 PB + 100-200 TB/month  Facebook: ~3 PB of user data + 25 TB/day  eBay: ~7 PB of user data + 50 TB/day  CERN’s Large Hydron Collider generates 15 PB a year  In 2010, enterprises stored 7 Exabytes = 7,000,000,000 GB 640K ought to be enough for anybody.
  • 5. Big Data Growing 5 The Untapped Data Gap: Most of the useful data will not be tagged or analyzed – partly due to skill shortage IDC predicts: From 2005 to 2020, the digital universe will double every 2 years and grow from 130 exabytes to 40,000 exabytes or 5,200 GB / person in 2020.
  • 6. What Is Data Mining? We are drowning in data, but starving for knowledge! “Necessity is the mother of invention”—Data mining— Automated analysis of massive data sets 6 The non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data in large data repositories  Data Mining: A Definition  Non-trivial: obvious knowledge is not useful  implicit: hidden difficult to observe knowledge  previously unknown  potentially useful: actionable; easy to understand
  • 7. 7 Data Mining: Confluence of Multiple Disciplines Data Mining Machine Learning Statistics Applications Algorithm Pattern Recognition High-Performance Computing Visualization Database Technology
  • 8. 8 Data Mining’s Virtuous Cycle 1. Identifying the problem 2. Mining data to transform it into actionable information 3. Acting on the information 4. Measuring the results
  • 9. 9 The Knowledge Discovery Process  Data Mining v. Knowledge Discovery in Databases (KDD)  DM and KDD are often used interchangeably  actually, DM is only part of the KDD process - The KDD Process
  • 10. 10 Types of Knowledge Discovery  Two kinds of knowledge discovery: directed and undirected  Directed Knowledge Discovery  Purpose: Explain value of some field in terms of all the others (goal-oriented)  Method: select the target field based on some hypothesis about the data; ask the algorithm to tell us how to predict or classify new instances  Examples: what products show increased sale when cream cheese is discounted which banner ad to use on a web page for a given user coming to the site  Undirected Knowledge Discovery  Purpose: Find patterns in the data that may be interesting (no target field)  Method: clustering, affinity grouping  Examples: which products in the catalog often sell together market segmentation (find groups of customers/users with similar characteristics or behavioral patterns)
  • 11. From Data Mining to Data Science 11
  • 12. 12 Data Mining: On What Kinds of Data?  Database-oriented data sets and applications Relational database, data warehouse, transactional database Object-relational databases, Heterogeneous databases and legacy databases  Advanced data sets and advanced applications Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and information networks Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web
  • 13. 13 Data Mining: What Kind of Data? Structured Databases relational, object-relational, etc. can use SQL to perform parts of the process e.g., SELECT count(*) FROM Items WHERE type=video GROUP BY category
  • 14. 14 Data Mining: What Kind of Data?  Flat Files  most common data source  can be text (or HTML) or binary  may contain transactions, statistical data, measurements, etc.  Transactional databases  set of records each with a transaction id, time stamp, and a set of items  may have an associated “description” file for the items  typical source of data used in market basket analysis
  • 15. 15 Data Mining: What Kind of Data?  Other Types of Databases  legacy databases  multimedia databases (usually very high-dimensional)  spatial databases (containing geographical information, such as maps, or satellite imaging data, etc.)  Time Series Temporal Data (time dependent information such as stock market data; usually very dynamic)  World Wide Web  basically a large, heterogeneous, distributed database  need for new or additional tools and techniques information retrieval, filtering and extraction agents to assist in browsing and filtering Web content, usage, and structure (linkage) mining tools  The “social Web” User generated meta-data, social networks, shared resources, etc.
  • 16. 16 What Can Data Mining Do Many Data Mining Tasks  often inter-related  often need to try different techniques/algorithms for each task  each tasks may require different types of knowledge discovery What are some of data mining tasks  Classification  Prediction  Clustering  Affinity Grouping / Association discovery  Sequence Analysis  Characterization  Discrimination
  • 17. 17 Some Applications of Data mining  Business data analysis and decision support Marketing focalization Recognizing specific market segments that respond to particular characteristics Return on mailing campaign (target marketing) Customer Profiling Segmentation of customer for marketing strategies and/or product offerings Customer behavior understanding Customer retention and loyalty Mass customization / personalization
  • 18. 18 Some Applications of Data mining  Business data analysis and decision support (cont.) Market analysis and management Provide summary information for decision-making Market basket analysis, cross selling, market segmentation. Resource planning Risk analysis and management "What if" analysis Forecasting Pricing analysis, competitive analysis Time-series analysis (Ex. stock market)
  • 19. 19 Some Applications of Data mining  Fraud detection Detecting telephone fraud: Telephone call model: destination of the call, duration, time of day or week Analyze patterns that deviate from an expected norm British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud scheme Detection of credit-card fraud Detecting suspicious money transactions (money laundering)  Text mining:  Message filtering (e-mail, newsgroups, etc.)  Newspaper articles analysis  Text and document categorization  Web Mining  Mining patterns from the content, usage, and structure of Web resources
  • 20. Types of Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Mining 20
  • 21. Types of Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Mining 21 Applications: • document clustering or categorization • topic identification / tracking • concept discovery • focused crawling • content-based personalization • intelligent search tools
  • 22. Types of Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Mining Applications: • user and customer behavior modeling • Web site optimization • e-customer relationship management • Web marketing • targeted advertising • recommender systems 22
  • 23. Types of Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Mining Applications: • document retrieval and ranking (e.g., Google) • discovery of “hubs” and “authorities” • discovery of Web communities • social network analysis 23
  • 24. 24 The Knowledge Discovery Process - The KDD Process  Next: We first focus on understanding the data and data preparation/transformation