Enviar pesquisa
Carregar
Dw-dm-part-01
•
1 gostou
•
340 visualizações
N
nash512
Seguir
Ware house
Leia menos
Leia mais
Liderança e gerenciamento
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 55
Baixar agora
Baixar para ler offline
Recomendados
Kudu as Storage Layer to Digitize Credit Processes
Kudu as Storage Layer to Digitize Credit Processes
DataWorks Summit
Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s Ne...
Rittman Analytics
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
Mark Rittman
Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQL
Basho Technologies
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
Amazon Web Services
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Mark Rittman
Oracle Big Data Spatial & Graph Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph Social Media Analysis - Case Study
Mark Rittman
Recomendados
Kudu as Storage Layer to Digitize Credit Processes
Kudu as Storage Layer to Digitize Credit Processes
DataWorks Summit
Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: What’s Ne...
Rittman Analytics
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
Mark Rittman
Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQL
Basho Technologies
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
Amazon Web Services
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Mark Rittman
Oracle Big Data Spatial & Graph Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph Social Media Analysis - Case Study
Mark Rittman
A Zen Journey to Database Management
A Zen Journey to Database Management
Basho Technologies
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
Mark Rittman
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
Bharat Kalia
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
Rim Moussa
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PROIDEA
Changing the game with cloud dw
Changing the game with cloud dw
elephantscale
Big Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
John Yeung
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
DataWorks Summit
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
Rittman Analytics
Solving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
Tyler Mitchell
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Mark Rittman
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...
Mark Rittman
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Mark Rittman
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Big Data Spain
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Rim Moussa
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big Data
DevFest DC
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Mark Rittman
Data mining
Data mining
Hoang Nguyen
Sun modeling
Sun modeling
Andy Cobley
Mais conteúdo relacionado
Mais procurados
A Zen Journey to Database Management
A Zen Journey to Database Management
Basho Technologies
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
Mark Rittman
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
Bharat Kalia
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
Rim Moussa
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PROIDEA
Changing the game with cloud dw
Changing the game with cloud dw
elephantscale
Big Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
John Yeung
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
DataWorks Summit
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
Rittman Analytics
Solving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
Tyler Mitchell
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Mark Rittman
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...
Mark Rittman
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Mark Rittman
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Big Data Spain
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Rim Moussa
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big Data
DevFest DC
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Mark Rittman
Mais procurados
(20)
A Zen Journey to Database Management
A Zen Journey to Database Management
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
Changing the game with cloud dw
Changing the game with cloud dw
Big Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
Solving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Semelhante a Dw-dm-part-01
Data mining
Data mining
Hoang Nguyen
Sun modeling
Sun modeling
Andy Cobley
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
John Berns
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guide
gokulprasath06
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Denodo
Does it Mix? Cassandra and RDBMS working together!
Does it Mix? Cassandra and RDBMS working together!
Carlos Juzarte Rolo
Daming
Daming
wetenrisoihafsa
The lean principles of data ops
The lean principles of data ops
Lars Albertsson
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
Statistisk sentralbyrå
data mining
data mining
nehaanand123
Part1
Part1
Amit Sharma
PLOTCON NYC: Interactive Visual Statistics on Massive Datasets
PLOTCON NYC: Interactive Visual Statistics on Massive Datasets
Plotly
Difference between data warehouse and data mining
Difference between data warehouse and data mining
maxonlinetr
Peter Jackson Keboola - London Tech Week - June 2018
Peter Jackson Keboola - London Tech Week - June 2018
Elena Manole
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB
Big data.ppt
Big data.ppt
IdontKnow66967
Knowledge Data Discovery-Dataware House.pptx
Knowledge Data Discovery-Dataware House.pptx
YosepKris2
Lecture1
Lecture1
Manish Singh
Data science tips for data engineers
Data science tips for data engineers
IBM Analytics
Semelhante a Dw-dm-part-01
(20)
Data mining
Data mining
Sun modeling
Sun modeling
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guide
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Does it Mix? Cassandra and RDBMS working together!
Does it Mix? Cassandra and RDBMS working together!
Daming
Daming
The lean principles of data ops
The lean principles of data ops
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
data mining
data mining
Part1
Part1
PLOTCON NYC: Interactive Visual Statistics on Massive Datasets
PLOTCON NYC: Interactive Visual Statistics on Massive Datasets
Difference between data warehouse and data mining
Difference between data warehouse and data mining
Peter Jackson Keboola - London Tech Week - June 2018
Peter Jackson Keboola - London Tech Week - June 2018
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
Big data.ppt
Big data.ppt
Knowledge Data Discovery-Dataware House.pptx
Knowledge Data Discovery-Dataware House.pptx
Lecture1
Lecture1
Data science tips for data engineers
Data science tips for data engineers
Último
Advancing Enterprise Risk Management Practices- A Strategic Framework by Naga...
Advancing Enterprise Risk Management Practices- A Strategic Framework by Naga...
Nagarjuna Reddy Aturi
Effective learning in the Age of Hybrid Work - Agile Saturday Tallinn 2024
Effective learning in the Age of Hybrid Work - Agile Saturday Tallinn 2024
Giuseppe De Simone
Mind Mapping: A Visual Approach to Organize Ideas and Thoughts
Mind Mapping: A Visual Approach to Organize Ideas and Thoughts
CIToolkit
Exploring Variable Relationships with Scatter Diagram Analysis
Exploring Variable Relationships with Scatter Diagram Analysis
CIToolkit
Management 11th Edition - Chapter 13 - Managing Teams
Management 11th Edition - Chapter 13 - Managing Teams
shakkardaddy
The Role of Box Plots in Comparing Multiple Data Sets
The Role of Box Plots in Comparing Multiple Data Sets
CIToolkit
How Technologies will change the relationship with Human Resources
How Technologies will change the relationship with Human Resources
Massimo Canducci
Overview PMI Infinity - UK Chapter presentation
Overview PMI Infinity - UK Chapter presentation
PMIUKChapter
Operations Management -- Sustainability and Supply Chain Management.pdf
Operations Management -- Sustainability and Supply Chain Management.pdf
coolsnoopy1
HOTEL MANAGEMENT SYSTEM PPT PRESENTATION
HOTEL MANAGEMENT SYSTEM PPT PRESENTATION
sivani14565220
Leveraging Gap Analysis for Continuous Improvement
Leveraging Gap Analysis for Continuous Improvement
CIToolkit
Adapting to Change: Using PEST Analysis for Better Decision-Making
Adapting to Change: Using PEST Analysis for Better Decision-Making
CIToolkit
The Role of Histograms in Exploring Data Insights
The Role of Histograms in Exploring Data Insights
CIToolkit
From Red to Green: Enhancing Decision-Making with Traffic Light Assessment
From Red to Green: Enhancing Decision-Making with Traffic Light Assessment
CIToolkit
Paired Comparison Analysis: A Practical Tool for Evaluating Options and Prior...
Paired Comparison Analysis: A Practical Tool for Evaluating Options and Prior...
CIToolkit
Management 11th Edition - Chapter 11 - Adaptive Organizational Design
Management 11th Edition - Chapter 11 - Adaptive Organizational Design
shakkardaddy
Better SAFe than sorry - Why scaled agile frameworks do not necessarily impro...
Better SAFe than sorry - Why scaled agile frameworks do not necessarily impro...
Conny Dethloff
Management 11th Edition - Chapter 9 - Strategic Management
Management 11th Edition - Chapter 9 - Strategic Management
shakkardaddy
BoSUSA23 | Chris Spiek & Justin Dickow | Autobooks Product & Engineering
BoSUSA23 | Chris Spiek & Justin Dickow | Autobooks Product & Engineering
Business of Software Conference
The Final Activity in Project Management
The Final Activity in Project Management
CIToolkit
Último
(20)
Advancing Enterprise Risk Management Practices- A Strategic Framework by Naga...
Advancing Enterprise Risk Management Practices- A Strategic Framework by Naga...
Effective learning in the Age of Hybrid Work - Agile Saturday Tallinn 2024
Effective learning in the Age of Hybrid Work - Agile Saturday Tallinn 2024
Mind Mapping: A Visual Approach to Organize Ideas and Thoughts
Mind Mapping: A Visual Approach to Organize Ideas and Thoughts
Exploring Variable Relationships with Scatter Diagram Analysis
Exploring Variable Relationships with Scatter Diagram Analysis
Management 11th Edition - Chapter 13 - Managing Teams
Management 11th Edition - Chapter 13 - Managing Teams
The Role of Box Plots in Comparing Multiple Data Sets
The Role of Box Plots in Comparing Multiple Data Sets
How Technologies will change the relationship with Human Resources
How Technologies will change the relationship with Human Resources
Overview PMI Infinity - UK Chapter presentation
Overview PMI Infinity - UK Chapter presentation
Operations Management -- Sustainability and Supply Chain Management.pdf
Operations Management -- Sustainability and Supply Chain Management.pdf
HOTEL MANAGEMENT SYSTEM PPT PRESENTATION
HOTEL MANAGEMENT SYSTEM PPT PRESENTATION
Leveraging Gap Analysis for Continuous Improvement
Leveraging Gap Analysis for Continuous Improvement
Adapting to Change: Using PEST Analysis for Better Decision-Making
Adapting to Change: Using PEST Analysis for Better Decision-Making
The Role of Histograms in Exploring Data Insights
The Role of Histograms in Exploring Data Insights
From Red to Green: Enhancing Decision-Making with Traffic Light Assessment
From Red to Green: Enhancing Decision-Making with Traffic Light Assessment
Paired Comparison Analysis: A Practical Tool for Evaluating Options and Prior...
Paired Comparison Analysis: A Practical Tool for Evaluating Options and Prior...
Management 11th Edition - Chapter 11 - Adaptive Organizational Design
Management 11th Edition - Chapter 11 - Adaptive Organizational Design
Better SAFe than sorry - Why scaled agile frameworks do not necessarily impro...
Better SAFe than sorry - Why scaled agile frameworks do not necessarily impro...
Management 11th Edition - Chapter 9 - Strategic Management
Management 11th Edition - Chapter 9 - Strategic Management
BoSUSA23 | Chris Spiek & Justin Dickow | Autobooks Product & Engineering
BoSUSA23 | Chris Spiek & Justin Dickow | Autobooks Product & Engineering
The Final Activity in Project Management
The Final Activity in Project Management
Dw-dm-part-01
1.
Data Warehousing and Data Mining ©ArunPhadke
12015-16
2.
Introduction Outline • Define
data mining • Data mining vs. databases • Basic data mining tasks • Data mining development • Data mining issues ©ArunPhadke 22015-16
3.
Introduction • Data is
growing at a phenomenal rate • Users expect more sophisticated information • How? ©ArunPhadke 3 UNCOVER HIDDEN INFORMATION DATA MINING 2015-16
4.
Data Mining Definition •
Finding hidden information in a database • Fit data to a model • Similar terms – Exploratory data analysis – Data driven discovery – Deductive learning ©ArunPhadke 42015-16
5.
Data Mining Algorithm •
Objective: Fit Data to a Model – Descriptive – Predictive • Preference – Technique to choose the best model • Search – Technique to search the data – “Query” ©ArunPhadke 52015-16
6.
Database Processing vs.
Data Mining Processing • Query – Well defined – SQL • Query – Poorly defined – No precise query language ©ArunPhadke 6 Data – Operational data Output – Precise – Subset of database Data – Not operational data Output – Fuzzy – Not a subset of database 2015-16
7.
Query Examples 2015-16 ©ArunPhadke
7 • Database – Find all credit applicants with last name of Smith – Identify customers who have purchased more than $10,000 in the last month – Find all customers who have purchased milk • Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules)
8.
Data Mining Models
and Tasks ©ArunPhadke 82015-16
9.
Basic Data Mining
Tasks • Classification maps data into predefined groups or classes – Supervised learning – Pattern recognition – Prediction • Regression is used to map a data item to a real valued prediction variable. • Clustering groups similar data together into clusters. – Unsupervised learning – Segmentation – Partitioning ©ArunPhadke 92015-16
10.
Basic Data Mining
Tasks (cont’d) • Summarization maps data into subsets with associated simple descriptions. – Characterization – Generalization • Link Analysis uncovers relationships among data. – Affinity Analysis – Association Rules – Sequential Analysis determines sequential patterns. ©ArunPhadke 102015-16
11.
Ex: Time Series
Analysis • Example: Stock Market • Predict future values • Determine similar patterns over time • Classify behavior ©ArunPhadke 11©ArunPhadke 112015-16
12.
Data Mining vs.
KDD • Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. • Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. ©ArunPhadke 122015-16
13.
KDD Process • Selection:
Obtain data from various sources. • Preprocessing: Cleanse data. • Transformation: Convert to common format. Transform to new format. • Data Mining: Obtain desired results. • Interpretation/Evaluation: Present results to user in meaningful manner. ©ArunPhadke 13 Modified from [FPSS96C] 2015-16
14.
KDD Process Ex:
Web Log • Selection: – Select log data (dates and locations) to use • Preprocessing: – Remove identifying URLs – Remove error logs • Transformation: – Sessionize (sort and group) • Data Mining: – Identify and count patterns – Construct data structure • Interpretation/Evaluation: – Identify and display frequently accessed sequences. • Potential User Applications: – Cache prediction – Personalization ©ArunPhadke 142015-16
15.
Data Mining Development ©ArunPhadke
15 Information Retrieval •Similarity Measures •Hierarchical Clustering •IR Systems •Imprecise Queries •Textual Data •Web Search Engines Statistics •Bayes Theorem •Regression Analysis •EM Algorithm •K-Means Clustering •Time Series Analysis Machine Learning •Neural Networks •Decision Tree Algorithms Algorithm •Algorithm Design Techniques •Algorithm Analysis •Data Structures Databases •Relational Data Model •SQL •Association Rule Algorithms •Data Warehousing •Scalability Techniques 2015-16
16.
KDD Issues • Human
Interaction • Overfitting • Outliers • Interpretation • Visualization • Large Datasets • High Dimensionality ©ArunPhadke 162015-16
17.
KDD Issues (cont’d) •
Multimedia Data • Missing Data • Irrelevant Data • Noisy Data • Changing Data • Integration • Application ©ArunPhadke 172015-16
18.
Social Implications of
DM • Privacy • Profiling • Unauthorized use ©ArunPhadke 182015-16
19.
Data Mining Metrics •
Usefulness • Return on Investment (ROI) • Accuracy • Space/Time ©ArunPhadke 192015-16
20.
Database Perspective on
Data Mining • Scalability • Real World Data • Updates • Ease of Use ©ArunPhadke 202015-16
21.
Visualization Techniques • Graphical •
Geometric • Icon-based • Pixel-based • Hierarchical • Hybrid ©ArunPhadke 212015-16
22.
Related Concepts Outline •
Database/OLTP Systems • Fuzzy Sets and Logic • Information Retrieval(Web Search Engines) • Dimensional Modeling • Data Warehousing • OLAP/DSS • Statistics • Machine Learning • Pattern Matching ©ArunPhadke 22 Goal: Examine some areas which are related to data mining. 2015-16
23.
DB & OLTP
Systems • Schema – (ID,Name,Address,Salary,JobNo) • Data Model – ER – Relational • Transaction • Query: SELECT Name FROM T WHERE Salary > 100000 DM: Only imprecise queries ©ArunPhadke 232015-16
24.
Fuzzy Sets and
Logic • Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. • f(x): Probability x is in F. • 1-f(x): Probability x is not in F. • EX: – T = {x | x is a person and x is tall} – Let f(x) be the probability that x is tall – Here f is the membership function DM: Prediction and classification are fuzzy. ©ArunPhadke 242015-16
25.
Fuzzy Sets ©ArunPhadke 252015-16
26.
Classification/Prediction is Fuzzy ©ArunPhadke
26 Loan Amnt Simple Fuzzy Accept Accept Reject Reject 2015-16
27.
Data Warehouse Data-warehouse-03.pptx 2015-16 ©ArunPhadke
27
28.
Data Cube Technology •
Data Cube Computation: Preliminary Concepts • Data Cube Computation Methods • Processing Advanced Queries by Exploring Data Cube Technology • Multidimensional Data Analysis in Cube Space 2015-16 ©ArunPhadke 28
29.
Data Cube :
Lattice of Cuboids 2015-16 ©ArunPhadke 29 time,item time,item,location time, item, location, supplierc all time item location supplier time,location time,supplier item,location item,supplier location,supplier time,item,supplier time,location,supplier item,location,supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid
30.
Data Cube :
Lattice of Cuboids • D0 Cube .. All …Zero Dimensions • D1 Cube …..One Dimension … 4 Cubes – Time – Item – Location – Supplier 2015-16 ©ArunPhadke 30
31.
Data Cube :
Lattice of Cuboids • D2 Cube .. Two Dimensions … 6 cubes – Time-Item – Time-Location – Time-Supplier – Item-Location – Item-Supplier – Location-Supplier 2015-16 ©ArunPhadke 31
32.
Data Cube :
Lattice of Cuboids • D3 Cube .. Three Dimensions … 4 cubes – Time-Item-Location – Time-Item-Supplier – Time-Location-supplier – Item-Location-supplier • D4 Cube ….Four Dimensions … 1 Cube • No of Cubes = 2{no of dimensions} 2015-16 ©ArunPhadke 32
33.
Size of Cubes •
Four dimensions – Time – yyyymmdd with 10 years ( 10x365 ) – Item – 1000 items – Location – 20 locations – Supplier – 500 suppliers • Maximum size = 10*365*1000*20*500 = 36,500,000,000 2015-16 ©ArunPhadke 33
34.
How to improve
performance • Select the right cube – Time-Item - 10*365*1000 = 3,650,000 – Time-Location - 10*365*20 = 73,000 – Time-Supplier - 10*365*500 = 1,825,000 – Item-Location – 1000*20 = 20,000 – Item-Supplier – 1000*500 = 500,000 – Location-Supplier - 20*500 = 10,000 – Time-Item-Location - 10*365*1000*20 =73,000,000 – Time-Item-Supplier - 10*365*1000*500 = 1,825,000,000 – Time-Location-supplier - 10*365*20*500 = 36,500,000 – Item-Location-supplier – 1000*20*500 = 10,000,000 2015-16 ©ArunPhadke 34
35.
Materialization – Pre-computing •
On-line analytical processing may need to access different cuboids for different queries • Compute some cuboids in advance – Pre-computation leads to fast response times – Most products support to some degree pre- computation 2015-16 ©ArunPhadke 35
36.
Materialization – Pre-computing •
Storage space may explode... – If there are no hierarchies the total number for n-dimensional cube is 2n • But.... – Many dimensions may have hierarchies, for example time • day < week < month < quarter < year – Explosion of cuboids 2015-16 ©ArunPhadke 36
37.
Efficient computation of
Data Cubes • Smallest-child: computing a cuboid from the smallest, previously computed cuboid • Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os • Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads • Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used • Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used 2015-16 ©ArunPhadke 37
38.
Multi-Array Aggragation • Array-based
“bottom-up” algorithm • Using multi-dimensional chunks • No direct tuple comparisons • Simultaneous aggregation on multiple dimensions • Intermediate aggregate values are re-used for computing ancestor cuboids 2015-16 ©ArunPhadke 38 all A B AB ABC AC BC C
39.
Multi-way Array Aggregation
for Cube Computation (MOLAP) • Partition arrays into chunks (a small subcube which fits in memory). • Compressed sparse array addressing: (chunk_id, offset) • Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. 2015-16 ©ArunPhadke 39A B 29 30 31 32 1 2 3 4 5 9 13 14 15 16 64636261 48474645 a1a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2 a3 C B 44 28 56 40 24 52 36 20 60
40.
Multi-way Array Aggregation
for Cube Computation (MOLAP) 2015-16 ©ArunPhadke 40 A B 29 30 31 32 1 2 3 4 5 9 13 14 15 16 64636261 48474645 a1a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2 a3 C 44 28 56 40 24 52 36 20 60 B
41.
Multi-way Array Aggregation
for Cube Computation (MOLAP) • Method: the planes should be sorted and computed according to their size in ascending order – Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane • Limitation of the method: computing well only for a small number of dimensions – If there are a large number of dimensions, “top-down” computation and iceberg cube computation methods can be explored 2015-16 ©ArunPhadke 41
42.
Iceberg Cube • An
Iceberg-Cube contains only those cells of the data cube that meet an aggregate condition • It is called an Iceberg-Cube because it contains only some of the cells of the full cube, like the tip of an iceberg 2015-16 ©ArunPhadke 42
43.
Iceberg Cube • The
purpose of the Iceberg-Cube is to identify and compute only those values that will most likely be required for decision support queries 2015-16 ©ArunPhadke 43
44.
Iceberg Cube example Part StoreLocati on Customer P1
Vancouver Vance P1 Calgary Bob P1 Toronto Richard P2 Toronto Allison P2 Toronto Allison P2 Toronto Tom P2 Ottawa Allison P3 Montreal Anne 2015-16 ©ArunPhadke 44 Combination Count {P1, ANY, ANY} 3 {P2, ANY, ANY} 4 {ANY, Toronto, ANY} 4 {ANY, ANY, Allison} 3 {P2, Toronto, ANY} 3 {P2, ANY, Allison} 3 {ANY, Toronto, Allison} 2 {P2, Toronto, Allison} 2 • Minimum support is 25% of tuples i.e. 2 tuples and we want to create an Iceberg-Cube
45.
APRIORI algorithm • The
APRIORI algorithm uses candidate combinations to avoid counting every possible combination of attribute values. • For a combination of attribute values to satisfy the minimum support requirement, all subsets of that combination must also satisfy minimum support. 2015-16 ©ArunPhadke 45
46.
APRIORI algorithm • The
candidate combinations are found by combining only the frequent attribute value combinations that are already known • All other possible combinations are automatically eliminated because not all of their subsets would satisfy the minimum support requirement 2015-16 ©ArunPhadke 46
47.
APRIORI A B C
D a1 b1 c3 d1 a1 b5 c1 d2 a1 b2 c5 d2 a2 b2 c2 d2 a2 b2 c2 d4 a2 b2 c4 d2 a2 b3 c2 d3 a3 b4 c6 d2 2015-16 ©ArunPhadke 47 Combin ation Count {a1} 3 {a2} 4 {b2} 4 {c2} 3 {d2} 5 • On the first pass over the data, the APRIORI algorithm determines that the single values shown in Table
48.
APRIORI 2015-16 ©ArunPhadke 48 Comb inatio n Count {a1}
3 {a2} 4 {b2} 4 {c2} 3 {d2} 5 • On the Second pass, the APRIORI algorithm determines that the single values shown in Table Combination {a1,b2} {a1,c2} {a1,d2} {a2, b2} {a2,c2} {a2,d2} {b2,c2} {b2,d2} {c2,d2} Combination Count {a1,d2} 2 {a2,b2} 3 {a2,c2} 3 {b2,c2} 2 {a2,d2} 2 {b2,d2} 3
49.
Top-Down Computation • The
algorithm begins by computing the frequent attribute value combinations for the attribute set at the top of the tree, in this case ABCD. 2015-16 ©ArunPhadke 49
50.
Top-Down Computation • On
the same pass over the data, tdC counts value combinations for ABCD, ABC, AB and A, adding the frequent ones to the Iceberg-Cube 2015-16 ©ArunPhadke 50
51.
Top-Down Computation A B
C D a1 b1 c3 d1 a1 b2 c5 d2 a1 b5 c1 d2 a2 b2 c2 d2 a2 b2 c2 d4 a2 b2 c4 d2 a2 b3 c2 d3 a3 b4 c6 d2 2015-16 ©ArunPhadke 51 Combination Count {a1} 3 {a2,b2,c2} 2 {a2,b2} 3 {a2} 4 Ordered by A,B,C,D Iceberg-Cube of ABCD
52.
Top-Down Computation A B
D a1 b1 d1 a1 b2 d2 a1 b5 d2 a2 b2 d2 a2 b2 d2 a2 b2 d4 a2 b3 d3 a3 b4 d2 2015-16 ©ArunPhadke 52 Combination Count {a1} 3 {a2,b2,d2} 2 {a2,b2} 3 {a2} 4 Ordered by A,B,D Iceberg-Cube of ABD
53.
Top-Down Computation 2015-16 ©ArunPhadke
53 Final Iceberg-Cube Combination Count {a1} 3 {a2} 4 {a2,b2} 3 {a2,b2,c2} 2 {a2,b2,d2} 2 {a2,c2} 3 {a1,d2} 2 {a2,d2} 2 {b2,c2} 2 {b2} 4 {b2,d2} 3 {c2} 3 {d2} 5
54.
Bit-Map Indexes • New
indexing techniques: Bitmap indexes, Join indexes, array representations, compression, pre- computation of aggregations, etc. 2015-16 ©ArunPhadke 54 112 Joe M 3 115 Ram M 5 119 Sue F 5 112 Woo M 4 10 10 01 10 00100 00001 00001 00010 Cust ID, Name,Sex, Rating Rating Sex Bit vector possible for each Value
55.
OLAP Vendor list •
IBM • Infor • Oracle OLAP • SAS • SAP BW • Microsoft (SQL Server OLAP) • Micro-strategy Corporation 2015-16 ©ArunPhadke 55
Baixar agora