SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
Data Warehousing
and
Data Mining
©ArunPhadke 12015-16
Introduction Outline
• Define data mining
• Data mining vs. databases
• Basic data mining tasks
• Data mining development
• Data mining issues
©ArunPhadke 22015-16
Introduction
• Data is growing at a phenomenal rate
• Users expect more sophisticated
information
• How?
©ArunPhadke 3
UNCOVER HIDDEN INFORMATION
DATA MINING
2015-16
Data Mining Definition
• Finding hidden information in a database
• Fit data to a model
• Similar terms
– Exploratory data analysis
– Data driven discovery
– Deductive learning
©ArunPhadke 42015-16
Data Mining Algorithm
• Objective: Fit Data to a Model
– Descriptive
– Predictive
• Preference – Technique to choose the
best model
• Search – Technique to search the data
– “Query”
©ArunPhadke 52015-16
Database Processing vs. Data Mining
Processing
• Query
– Well defined
– SQL
• Query
– Poorly defined
– No precise query language
©ArunPhadke 6
 Data
– Operational data
 Output
– Precise
– Subset of database
 Data
– Not operational data
 Output
– Fuzzy
– Not a subset of database
2015-16
Query Examples
2015-16 ©ArunPhadke 7
• Database
– Find all credit applicants with last name of Smith
– Identify customers who have purchased more than
$10,000 in the last month
– Find all customers who have purchased milk
• Data Mining
– Find all credit applicants who are poor credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased with
milk. (association rules)
Data Mining Models and Tasks
©ArunPhadke 82015-16
Basic Data Mining Tasks
• Classification maps data into predefined
groups or classes
– Supervised learning
– Pattern recognition
– Prediction
• Regression is used to map a data item to a
real valued prediction variable.
• Clustering groups similar data together into
clusters.
– Unsupervised learning
– Segmentation
– Partitioning
©ArunPhadke 92015-16
Basic Data Mining Tasks (cont’d)
• Summarization maps data into subsets with
associated simple descriptions.
– Characterization
– Generalization
• Link Analysis uncovers relationships among
data.
– Affinity Analysis
– Association Rules
– Sequential Analysis determines sequential
patterns.
©ArunPhadke 102015-16
Ex: Time Series Analysis
• Example: Stock Market
• Predict future values
• Determine similar patterns over time
• Classify behavior
©ArunPhadke 11©ArunPhadke 112015-16
Data Mining vs. KDD
• Knowledge Discovery in Databases
(KDD): process of finding useful
information and patterns in data.
• Data Mining: Use of algorithms to extract
the information and patterns derived by
the KDD process.
©ArunPhadke 122015-16
KDD Process
• Selection: Obtain data from various
sources.
• Preprocessing: Cleanse data.
• Transformation: Convert to common
format. Transform to new format.
• Data Mining: Obtain desired results.
• Interpretation/Evaluation: Present results
to user in meaningful manner.
©ArunPhadke 13
Modified from [FPSS96C]
2015-16
KDD Process Ex: Web Log
• Selection:
– Select log data (dates and locations) to use
• Preprocessing:
– Remove identifying URLs
– Remove error logs
• Transformation:
– Sessionize (sort and group)
• Data Mining:
– Identify and count patterns
– Construct data structure
• Interpretation/Evaluation:
– Identify and display frequently accessed sequences.
• Potential User Applications:
– Cache prediction
– Personalization
©ArunPhadke 142015-16
Data Mining Development
©ArunPhadke 15
Information Retrieval
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
Statistics
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
Machine Learning
•Neural Networks
•Decision Tree Algorithms
Algorithm
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
Databases
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
2015-16
KDD Issues
• Human Interaction
• Overfitting
• Outliers
• Interpretation
• Visualization
• Large Datasets
• High Dimensionality
©ArunPhadke 162015-16
KDD Issues (cont’d)
• Multimedia Data
• Missing Data
• Irrelevant Data
• Noisy Data
• Changing Data
• Integration
• Application
©ArunPhadke 172015-16
Social Implications of DM
• Privacy
• Profiling
• Unauthorized use
©ArunPhadke 182015-16
Data Mining Metrics
• Usefulness
• Return on Investment (ROI)
• Accuracy
• Space/Time
©ArunPhadke 192015-16
Database Perspective on Data Mining
• Scalability
• Real World Data
• Updates
• Ease of Use
©ArunPhadke 202015-16
Visualization Techniques
• Graphical
• Geometric
• Icon-based
• Pixel-based
• Hierarchical
• Hybrid
©ArunPhadke 212015-16
Related Concepts Outline
• Database/OLTP Systems
• Fuzzy Sets and Logic
• Information Retrieval(Web Search
Engines)
• Dimensional Modeling
• Data Warehousing
• OLAP/DSS
• Statistics
• Machine Learning
• Pattern Matching
©ArunPhadke 22
Goal: Examine some areas which are related to
data mining.
2015-16
DB & OLTP Systems
• Schema
– (ID,Name,Address,Salary,JobNo)
• Data Model
– ER
– Relational
• Transaction
• Query:
SELECT Name
FROM T
WHERE Salary > 100000
DM: Only imprecise queries
©ArunPhadke 232015-16
Fuzzy Sets and Logic
• Fuzzy Set: Set membership function is a real
valued function with output in the range [0,1].
• f(x): Probability x is in F.
• 1-f(x): Probability x is not in F.
• EX:
– T = {x | x is a person and x is tall}
– Let f(x) be the probability that x is tall
– Here f is the membership function
DM: Prediction and classification are fuzzy.
©ArunPhadke 242015-16
Fuzzy Sets
©ArunPhadke 252015-16
Classification/Prediction is Fuzzy
©ArunPhadke 26
Loan
Amnt
Simple Fuzzy
Accept Accept
Reject
Reject
2015-16
Data Warehouse
Data-warehouse-03.pptx
2015-16 ©ArunPhadke 27
Data Cube Technology
• Data Cube Computation: Preliminary
Concepts
• Data Cube Computation Methods
• Processing Advanced Queries by Exploring
Data Cube Technology
• Multidimensional Data Analysis in Cube
Space
2015-16 ©ArunPhadke 28
Data Cube : Lattice of Cuboids
2015-16 ©ArunPhadke 29
time,item
time,item,location
time, item, location, supplierc
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Data Cube : Lattice of Cuboids
• D0 Cube .. All …Zero Dimensions
• D1 Cube …..One Dimension … 4 Cubes
– Time
– Item
– Location
– Supplier
2015-16 ©ArunPhadke 30
Data Cube : Lattice of Cuboids
• D2 Cube .. Two Dimensions … 6 cubes
– Time-Item
– Time-Location
– Time-Supplier
– Item-Location
– Item-Supplier
– Location-Supplier
2015-16 ©ArunPhadke 31
Data Cube : Lattice of Cuboids
• D3 Cube .. Three Dimensions … 4 cubes
– Time-Item-Location
– Time-Item-Supplier
– Time-Location-supplier
– Item-Location-supplier
• D4 Cube ….Four Dimensions … 1 Cube
• No of Cubes = 2{no of dimensions}
2015-16 ©ArunPhadke 32
Size of Cubes
• Four dimensions
– Time – yyyymmdd with 10 years ( 10x365 )
– Item – 1000 items
– Location – 20 locations
– Supplier – 500 suppliers
• Maximum size = 10*365*1000*20*500 =
36,500,000,000
2015-16 ©ArunPhadke 33
How to improve performance
• Select the right cube
– Time-Item - 10*365*1000 = 3,650,000
– Time-Location - 10*365*20 = 73,000
– Time-Supplier - 10*365*500 = 1,825,000
– Item-Location – 1000*20 = 20,000
– Item-Supplier – 1000*500 = 500,000
– Location-Supplier - 20*500 = 10,000
– Time-Item-Location - 10*365*1000*20 =73,000,000
– Time-Item-Supplier - 10*365*1000*500 = 1,825,000,000
– Time-Location-supplier - 10*365*20*500 = 36,500,000
– Item-Location-supplier – 1000*20*500 = 10,000,000
2015-16 ©ArunPhadke 34
Materialization – Pre-computing
• On-line analytical processing may need to
access different cuboids for different
queries
• Compute some cuboids in advance
– Pre-computation leads to fast response times
– Most products support to some degree pre-
computation
2015-16 ©ArunPhadke 35
Materialization – Pre-computing
• Storage space may explode...
– If there are no hierarchies the total number
for n-dimensional cube is 2n
• But....
– Many dimensions may have hierarchies, for example
time
• day < week < month < quarter < year
– Explosion of cuboids
2015-16 ©ArunPhadke 36
Efficient computation of Data Cubes
• Smallest-child: computing a cuboid from the smallest,
previously computed cuboid
• Cache-results: caching results of a cuboid from which other
cuboids are computed to reduce disk I/Os
• Amortize-scans: computing as many as possible cuboids at
the same time to amortize disk reads
• Share-sorts: sharing sorting costs cross multiple cuboids
when sort-based method is used
• Share-partitions: sharing the partitioning cost across multiple
cuboids when hash-based algorithms are used
2015-16 ©ArunPhadke 37
Multi-Array Aggragation
• Array-based “bottom-up”
algorithm
• Using multi-dimensional chunks
• No direct tuple comparisons
• Simultaneous aggregation on
multiple dimensions
• Intermediate aggregate values
are re-used for computing
ancestor cuboids
2015-16 ©ArunPhadke 38
all
A B
AB
ABC
AC BC
C
Multi-way Array Aggregation for Cube
Computation (MOLAP)
• Partition arrays into chunks (a small subcube which fits
in memory).
• Compressed sparse array addressing: (chunk_id, offset)
• Compute aggregates in “multiway” by visiting cube cells
in the order which minimizes the # of times to visit each
cell, and reduces memory access and storage cost.
2015-16 ©ArunPhadke 39A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
64636261
48474645
a1a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2 a3
C
B
44
28 56
40
24 52
36
20
60
Multi-way Array Aggregation for Cube
Computation (MOLAP)
2015-16 ©ArunPhadke 40
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
64636261
48474645
a1a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2 a3
C
44
28 56
40
24 52
36
20
60
B
Multi-way Array Aggregation for Cube
Computation (MOLAP)
• Method: the planes should be sorted and
computed according to their size in ascending
order
– Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane
• Limitation of the method: computing well only
for a small number of dimensions
– If there are a large number of dimensions, “top-down”
computation and iceberg cube computation methods
can be explored
2015-16 ©ArunPhadke 41
Iceberg Cube
• An Iceberg-Cube contains only those cells
of the data cube that meet an aggregate
condition
• It is called an Iceberg-Cube because it
contains only some of the cells of the full
cube, like the tip of an iceberg
2015-16 ©ArunPhadke 42
Iceberg Cube
• The purpose of the Iceberg-Cube is to
identify and compute only those values
that will most likely be required for
decision support queries
2015-16 ©ArunPhadke 43
Iceberg Cube example
Part
StoreLocati
on
Customer
P1 Vancouver Vance
P1 Calgary Bob
P1 Toronto Richard
P2 Toronto Allison
P2 Toronto Allison
P2 Toronto Tom
P2 Ottawa Allison
P3 Montreal Anne
2015-16 ©ArunPhadke 44
Combination Count
{P1, ANY, ANY} 3
{P2, ANY, ANY} 4
{ANY, Toronto, ANY} 4
{ANY, ANY, Allison} 3
{P2, Toronto, ANY} 3
{P2, ANY, Allison} 3
{ANY, Toronto, Allison} 2
{P2, Toronto, Allison} 2
• Minimum support is 25% of tuples i.e. 2 tuples
and we want to create an Iceberg-Cube
APRIORI algorithm
• The APRIORI algorithm uses candidate
combinations to avoid counting every
possible combination of attribute values.
• For a combination of attribute values to
satisfy the minimum support requirement,
all subsets of that combination must also
satisfy minimum support.
2015-16 ©ArunPhadke 45
APRIORI algorithm
• The candidate combinations are found by
combining only the frequent attribute
value combinations that are already
known
• All other possible combinations are
automatically eliminated because not all
of their subsets would satisfy the
minimum support requirement
2015-16 ©ArunPhadke 46
APRIORI
A B C D
a1 b1 c3 d1
a1 b5 c1 d2
a1 b2 c5 d2
a2 b2 c2 d2
a2 b2 c2 d4
a2 b2 c4 d2
a2 b3 c2 d3
a3 b4 c6 d2
2015-16 ©ArunPhadke 47
Combin
ation
Count
{a1} 3
{a2} 4
{b2} 4
{c2} 3
{d2} 5
• On the first pass over the data, the
APRIORI algorithm determines that the
single values shown in Table
APRIORI
2015-16 ©ArunPhadke 48
Comb
inatio
n
Count
{a1} 3
{a2} 4
{b2} 4
{c2} 3
{d2} 5
• On the Second pass, the APRIORI
algorithm determines that the single
values shown in Table
Combination
{a1,b2}
{a1,c2}
{a1,d2}
{a2, b2}
{a2,c2}
{a2,d2}
{b2,c2}
{b2,d2}
{c2,d2}
Combination Count
{a1,d2} 2
{a2,b2} 3
{a2,c2} 3
{b2,c2} 2
{a2,d2} 2
{b2,d2} 3
Top-Down Computation
• The algorithm begins by computing the
frequent attribute value combinations for
the attribute set at the top of the tree, in
this case ABCD.
2015-16 ©ArunPhadke 49
Top-Down Computation
• On the same pass over the data, tdC
counts value combinations for ABCD,
ABC, AB and A, adding the frequent ones
to the Iceberg-Cube
2015-16 ©ArunPhadke 50
Top-Down Computation
A B C D
a1 b1 c3 d1
a1 b2 c5 d2
a1 b5 c1 d2
a2 b2 c2 d2
a2 b2 c2 d4
a2 b2 c4 d2
a2 b3 c2 d3
a3 b4 c6 d2
2015-16 ©ArunPhadke 51
Combination Count
{a1} 3
{a2,b2,c2} 2
{a2,b2} 3
{a2} 4
Ordered by A,B,C,D
Iceberg-Cube of ABCD
Top-Down Computation
A B D
a1 b1 d1
a1 b2 d2
a1 b5 d2
a2 b2 d2
a2 b2 d2
a2 b2 d4
a2 b3 d3
a3 b4 d2
2015-16 ©ArunPhadke 52
Combination Count
{a1} 3
{a2,b2,d2} 2
{a2,b2} 3
{a2} 4
Ordered by A,B,D
Iceberg-Cube of ABD
Top-Down Computation
2015-16 ©ArunPhadke 53
Final Iceberg-Cube
Combination Count
{a1} 3
{a2} 4
{a2,b2} 3
{a2,b2,c2} 2
{a2,b2,d2} 2
{a2,c2} 3
{a1,d2} 2
{a2,d2} 2
{b2,c2} 2
{b2} 4
{b2,d2} 3
{c2} 3
{d2} 5
Bit-Map Indexes
• New indexing techniques: Bitmap
indexes, Join indexes, array
representations, compression, pre-
computation of aggregations, etc.
2015-16 ©ArunPhadke 54
112 Joe M 3
115 Ram M 5
119 Sue F 5
112 Woo M 4
10
10
01
10
00100
00001
00001
00010
Cust ID, Name,Sex, Rating Rating
Sex
Bit vector
possible for
each Value
OLAP Vendor list
• IBM
• Infor
• Oracle OLAP
• SAS
• SAP BW
• Microsoft (SQL Server OLAP)
• Micro-strategy Corporation
2015-16 ©ArunPhadke 55

Mais conteúdo relacionado

Mais procurados

A Zen Journey to Database Management
A Zen Journey to Database ManagementA Zen Journey to Database Management
A Zen Journey to Database ManagementBasho Technologies
 
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle CloudOTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle CloudMark Rittman
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia Bharat Kalia
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsTPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsRim Moussa
 
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...PROIDEA
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dwelephantscale
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design PatternsJohn Yeung
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?DataWorks Summit
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017Rittman Analytics
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on HadoopTyler Mitchell
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Mark Rittman
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
 
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements  traditional business anal...What is Big Data Discovery, and how it complements  traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Mark Rittman
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...Big Data Spain
 
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsBenchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsRim Moussa
 
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataSnowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataDevFest DC
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 

Mais procurados (20)

A Zen Journey to Database Management
A Zen Journey to Database ManagementA Zen Journey to Database Management
A Zen Journey to Database Management
 
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle CloudOTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
 
TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsTPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
 
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
 
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements  traditional business anal...What is Big Data Discovery, and how it complements  traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsBenchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
 
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataSnowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big Data
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 

Semelhante a Dw-dm-part-01

IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014John Berns
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guidegokulprasath06
 
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesLogical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesDenodo
 
Does it Mix? Cassandra and RDBMS working together!
Does it Mix? Cassandra and RDBMS working together!Does it Mix? Cassandra and RDBMS working together!
Does it Mix? Cassandra and RDBMS working together!Carlos Juzarte Rolo
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...Statistisk sentralbyrå
 
PLOTCON NYC: Interactive Visual Statistics on Massive Datasets
PLOTCON NYC: Interactive Visual Statistics on Massive DatasetsPLOTCON NYC: Interactive Visual Statistics on Massive Datasets
PLOTCON NYC: Interactive Visual Statistics on Massive DatasetsPlotly
 
Difference between data warehouse and data mining
Difference between data warehouse and data miningDifference between data warehouse and data mining
Difference between data warehouse and data miningmaxonlinetr
 
Peter Jackson Keboola - London Tech Week - June 2018
Peter Jackson Keboola - London Tech Week - June 2018Peter Jackson Keboola - London Tech Week - June 2018
Peter Jackson Keboola - London Tech Week - June 2018Elena Manole
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB
 
Knowledge Data Discovery-Dataware House.pptx
Knowledge Data Discovery-Dataware House.pptxKnowledge Data Discovery-Dataware House.pptx
Knowledge Data Discovery-Dataware House.pptxYosepKris2
 
Data science tips for data engineers
Data science tips for data engineersData science tips for data engineers
Data science tips for data engineersIBM Analytics
 

Semelhante a Dw-dm-part-01 (20)

Data mining
Data miningData mining
Data mining
 
Sun modeling
Sun modelingSun modeling
Sun modeling
 
IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014IoT and Big Data - Iot Asia 2014
IoT and Big Data - Iot Asia 2014
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guide
 
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesLogical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business Outcomes
 
Does it Mix? Cassandra and RDBMS working together!
Does it Mix? Cassandra and RDBMS working together!Does it Mix? Cassandra and RDBMS working together!
Does it Mix? Cassandra and RDBMS working together!
 
Daming
DamingDaming
Daming
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Co...
 
data mining
data miningdata mining
data mining
 
Part1
Part1Part1
Part1
 
PLOTCON NYC: Interactive Visual Statistics on Massive Datasets
PLOTCON NYC: Interactive Visual Statistics on Massive DatasetsPLOTCON NYC: Interactive Visual Statistics on Massive Datasets
PLOTCON NYC: Interactive Visual Statistics on Massive Datasets
 
Difference between data warehouse and data mining
Difference between data warehouse and data miningDifference between data warehouse and data mining
Difference between data warehouse and data mining
 
Peter Jackson Keboola - London Tech Week - June 2018
Peter Jackson Keboola - London Tech Week - June 2018Peter Jackson Keboola - London Tech Week - June 2018
Peter Jackson Keboola - London Tech Week - June 2018
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Knowledge Data Discovery-Dataware House.pptx
Knowledge Data Discovery-Dataware House.pptxKnowledge Data Discovery-Dataware House.pptx
Knowledge Data Discovery-Dataware House.pptx
 
Lecture1
Lecture1Lecture1
Lecture1
 
Data science tips for data engineers
Data science tips for data engineersData science tips for data engineers
Data science tips for data engineers
 

Último

Advancing Enterprise Risk Management Practices- A Strategic Framework by Naga...
Advancing Enterprise Risk Management Practices- A Strategic Framework by Naga...Advancing Enterprise Risk Management Practices- A Strategic Framework by Naga...
Advancing Enterprise Risk Management Practices- A Strategic Framework by Naga...Nagarjuna Reddy Aturi
 
Effective learning in the Age of Hybrid Work - Agile Saturday Tallinn 2024
Effective learning in the Age of Hybrid Work - Agile Saturday Tallinn 2024Effective learning in the Age of Hybrid Work - Agile Saturday Tallinn 2024
Effective learning in the Age of Hybrid Work - Agile Saturday Tallinn 2024Giuseppe De Simone
 
Mind Mapping: A Visual Approach to Organize Ideas and Thoughts
Mind Mapping: A Visual Approach to Organize Ideas and ThoughtsMind Mapping: A Visual Approach to Organize Ideas and Thoughts
Mind Mapping: A Visual Approach to Organize Ideas and ThoughtsCIToolkit
 
Exploring Variable Relationships with Scatter Diagram Analysis
Exploring Variable Relationships with Scatter Diagram AnalysisExploring Variable Relationships with Scatter Diagram Analysis
Exploring Variable Relationships with Scatter Diagram AnalysisCIToolkit
 
Management 11th Edition - Chapter 13 - Managing Teams
Management 11th Edition - Chapter 13 - Managing TeamsManagement 11th Edition - Chapter 13 - Managing Teams
Management 11th Edition - Chapter 13 - Managing Teamsshakkardaddy
 
The Role of Box Plots in Comparing Multiple Data Sets
The Role of Box Plots in Comparing Multiple Data SetsThe Role of Box Plots in Comparing Multiple Data Sets
The Role of Box Plots in Comparing Multiple Data SetsCIToolkit
 
How Technologies will change the relationship with Human Resources
How Technologies will change the relationship with Human ResourcesHow Technologies will change the relationship with Human Resources
How Technologies will change the relationship with Human ResourcesMassimo Canducci
 
Overview PMI Infinity - UK Chapter presentation
Overview PMI Infinity - UK Chapter presentationOverview PMI Infinity - UK Chapter presentation
Overview PMI Infinity - UK Chapter presentationPMIUKChapter
 
Operations Management -- Sustainability and Supply Chain Management.pdf
Operations Management -- Sustainability and Supply Chain Management.pdfOperations Management -- Sustainability and Supply Chain Management.pdf
Operations Management -- Sustainability and Supply Chain Management.pdfcoolsnoopy1
 
HOTEL MANAGEMENT SYSTEM PPT PRESENTATION
HOTEL MANAGEMENT SYSTEM PPT PRESENTATIONHOTEL MANAGEMENT SYSTEM PPT PRESENTATION
HOTEL MANAGEMENT SYSTEM PPT PRESENTATIONsivani14565220
 
Leveraging Gap Analysis for Continuous Improvement
Leveraging Gap Analysis for Continuous ImprovementLeveraging Gap Analysis for Continuous Improvement
Leveraging Gap Analysis for Continuous ImprovementCIToolkit
 
Adapting to Change: Using PEST Analysis for Better Decision-Making
Adapting to Change: Using PEST Analysis for Better Decision-MakingAdapting to Change: Using PEST Analysis for Better Decision-Making
Adapting to Change: Using PEST Analysis for Better Decision-MakingCIToolkit
 
The Role of Histograms in Exploring Data Insights
The Role of Histograms in Exploring Data InsightsThe Role of Histograms in Exploring Data Insights
The Role of Histograms in Exploring Data InsightsCIToolkit
 
From Red to Green: Enhancing Decision-Making with Traffic Light Assessment
From Red to Green: Enhancing Decision-Making with Traffic Light AssessmentFrom Red to Green: Enhancing Decision-Making with Traffic Light Assessment
From Red to Green: Enhancing Decision-Making with Traffic Light AssessmentCIToolkit
 
Paired Comparison Analysis: A Practical Tool for Evaluating Options and Prior...
Paired Comparison Analysis: A Practical Tool for Evaluating Options and Prior...Paired Comparison Analysis: A Practical Tool for Evaluating Options and Prior...
Paired Comparison Analysis: A Practical Tool for Evaluating Options and Prior...CIToolkit
 
Management 11th Edition - Chapter 11 - Adaptive Organizational Design
Management 11th Edition - Chapter 11 - Adaptive Organizational DesignManagement 11th Edition - Chapter 11 - Adaptive Organizational Design
Management 11th Edition - Chapter 11 - Adaptive Organizational Designshakkardaddy
 
Better SAFe than sorry - Why scaled agile frameworks do not necessarily impro...
Better SAFe than sorry - Why scaled agile frameworks do not necessarily impro...Better SAFe than sorry - Why scaled agile frameworks do not necessarily impro...
Better SAFe than sorry - Why scaled agile frameworks do not necessarily impro...Conny Dethloff
 
Management 11th Edition - Chapter 9 - Strategic Management
Management 11th Edition - Chapter 9 - Strategic ManagementManagement 11th Edition - Chapter 9 - Strategic Management
Management 11th Edition - Chapter 9 - Strategic Managementshakkardaddy
 
BoSUSA23 | Chris Spiek & Justin Dickow | Autobooks Product & Engineering
BoSUSA23 | Chris Spiek & Justin Dickow | Autobooks Product & EngineeringBoSUSA23 | Chris Spiek & Justin Dickow | Autobooks Product & Engineering
BoSUSA23 | Chris Spiek & Justin Dickow | Autobooks Product & EngineeringBusiness of Software Conference
 
The Final Activity in Project Management
The Final Activity in Project ManagementThe Final Activity in Project Management
The Final Activity in Project ManagementCIToolkit
 

Último (20)

Advancing Enterprise Risk Management Practices- A Strategic Framework by Naga...
Advancing Enterprise Risk Management Practices- A Strategic Framework by Naga...Advancing Enterprise Risk Management Practices- A Strategic Framework by Naga...
Advancing Enterprise Risk Management Practices- A Strategic Framework by Naga...
 
Effective learning in the Age of Hybrid Work - Agile Saturday Tallinn 2024
Effective learning in the Age of Hybrid Work - Agile Saturday Tallinn 2024Effective learning in the Age of Hybrid Work - Agile Saturday Tallinn 2024
Effective learning in the Age of Hybrid Work - Agile Saturday Tallinn 2024
 
Mind Mapping: A Visual Approach to Organize Ideas and Thoughts
Mind Mapping: A Visual Approach to Organize Ideas and ThoughtsMind Mapping: A Visual Approach to Organize Ideas and Thoughts
Mind Mapping: A Visual Approach to Organize Ideas and Thoughts
 
Exploring Variable Relationships with Scatter Diagram Analysis
Exploring Variable Relationships with Scatter Diagram AnalysisExploring Variable Relationships with Scatter Diagram Analysis
Exploring Variable Relationships with Scatter Diagram Analysis
 
Management 11th Edition - Chapter 13 - Managing Teams
Management 11th Edition - Chapter 13 - Managing TeamsManagement 11th Edition - Chapter 13 - Managing Teams
Management 11th Edition - Chapter 13 - Managing Teams
 
The Role of Box Plots in Comparing Multiple Data Sets
The Role of Box Plots in Comparing Multiple Data SetsThe Role of Box Plots in Comparing Multiple Data Sets
The Role of Box Plots in Comparing Multiple Data Sets
 
How Technologies will change the relationship with Human Resources
How Technologies will change the relationship with Human ResourcesHow Technologies will change the relationship with Human Resources
How Technologies will change the relationship with Human Resources
 
Overview PMI Infinity - UK Chapter presentation
Overview PMI Infinity - UK Chapter presentationOverview PMI Infinity - UK Chapter presentation
Overview PMI Infinity - UK Chapter presentation
 
Operations Management -- Sustainability and Supply Chain Management.pdf
Operations Management -- Sustainability and Supply Chain Management.pdfOperations Management -- Sustainability and Supply Chain Management.pdf
Operations Management -- Sustainability and Supply Chain Management.pdf
 
HOTEL MANAGEMENT SYSTEM PPT PRESENTATION
HOTEL MANAGEMENT SYSTEM PPT PRESENTATIONHOTEL MANAGEMENT SYSTEM PPT PRESENTATION
HOTEL MANAGEMENT SYSTEM PPT PRESENTATION
 
Leveraging Gap Analysis for Continuous Improvement
Leveraging Gap Analysis for Continuous ImprovementLeveraging Gap Analysis for Continuous Improvement
Leveraging Gap Analysis for Continuous Improvement
 
Adapting to Change: Using PEST Analysis for Better Decision-Making
Adapting to Change: Using PEST Analysis for Better Decision-MakingAdapting to Change: Using PEST Analysis for Better Decision-Making
Adapting to Change: Using PEST Analysis for Better Decision-Making
 
The Role of Histograms in Exploring Data Insights
The Role of Histograms in Exploring Data InsightsThe Role of Histograms in Exploring Data Insights
The Role of Histograms in Exploring Data Insights
 
From Red to Green: Enhancing Decision-Making with Traffic Light Assessment
From Red to Green: Enhancing Decision-Making with Traffic Light AssessmentFrom Red to Green: Enhancing Decision-Making with Traffic Light Assessment
From Red to Green: Enhancing Decision-Making with Traffic Light Assessment
 
Paired Comparison Analysis: A Practical Tool for Evaluating Options and Prior...
Paired Comparison Analysis: A Practical Tool for Evaluating Options and Prior...Paired Comparison Analysis: A Practical Tool for Evaluating Options and Prior...
Paired Comparison Analysis: A Practical Tool for Evaluating Options and Prior...
 
Management 11th Edition - Chapter 11 - Adaptive Organizational Design
Management 11th Edition - Chapter 11 - Adaptive Organizational DesignManagement 11th Edition - Chapter 11 - Adaptive Organizational Design
Management 11th Edition - Chapter 11 - Adaptive Organizational Design
 
Better SAFe than sorry - Why scaled agile frameworks do not necessarily impro...
Better SAFe than sorry - Why scaled agile frameworks do not necessarily impro...Better SAFe than sorry - Why scaled agile frameworks do not necessarily impro...
Better SAFe than sorry - Why scaled agile frameworks do not necessarily impro...
 
Management 11th Edition - Chapter 9 - Strategic Management
Management 11th Edition - Chapter 9 - Strategic ManagementManagement 11th Edition - Chapter 9 - Strategic Management
Management 11th Edition - Chapter 9 - Strategic Management
 
BoSUSA23 | Chris Spiek & Justin Dickow | Autobooks Product & Engineering
BoSUSA23 | Chris Spiek & Justin Dickow | Autobooks Product & EngineeringBoSUSA23 | Chris Spiek & Justin Dickow | Autobooks Product & Engineering
BoSUSA23 | Chris Spiek & Justin Dickow | Autobooks Product & Engineering
 
The Final Activity in Project Management
The Final Activity in Project ManagementThe Final Activity in Project Management
The Final Activity in Project Management
 

Dw-dm-part-01

  • 2. Introduction Outline • Define data mining • Data mining vs. databases • Basic data mining tasks • Data mining development • Data mining issues ©ArunPhadke 22015-16
  • 3. Introduction • Data is growing at a phenomenal rate • Users expect more sophisticated information • How? ©ArunPhadke 3 UNCOVER HIDDEN INFORMATION DATA MINING 2015-16
  • 4. Data Mining Definition • Finding hidden information in a database • Fit data to a model • Similar terms – Exploratory data analysis – Data driven discovery – Deductive learning ©ArunPhadke 42015-16
  • 5. Data Mining Algorithm • Objective: Fit Data to a Model – Descriptive – Predictive • Preference – Technique to choose the best model • Search – Technique to search the data – “Query” ©ArunPhadke 52015-16
  • 6. Database Processing vs. Data Mining Processing • Query – Well defined – SQL • Query – Poorly defined – No precise query language ©ArunPhadke 6  Data – Operational data  Output – Precise – Subset of database  Data – Not operational data  Output – Fuzzy – Not a subset of database 2015-16
  • 7. Query Examples 2015-16 ©ArunPhadke 7 • Database – Find all credit applicants with last name of Smith – Identify customers who have purchased more than $10,000 in the last month – Find all customers who have purchased milk • Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules)
  • 8. Data Mining Models and Tasks ©ArunPhadke 82015-16
  • 9. Basic Data Mining Tasks • Classification maps data into predefined groups or classes – Supervised learning – Pattern recognition – Prediction • Regression is used to map a data item to a real valued prediction variable. • Clustering groups similar data together into clusters. – Unsupervised learning – Segmentation – Partitioning ©ArunPhadke 92015-16
  • 10. Basic Data Mining Tasks (cont’d) • Summarization maps data into subsets with associated simple descriptions. – Characterization – Generalization • Link Analysis uncovers relationships among data. – Affinity Analysis – Association Rules – Sequential Analysis determines sequential patterns. ©ArunPhadke 102015-16
  • 11. Ex: Time Series Analysis • Example: Stock Market • Predict future values • Determine similar patterns over time • Classify behavior ©ArunPhadke 11©ArunPhadke 112015-16
  • 12. Data Mining vs. KDD • Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. • Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. ©ArunPhadke 122015-16
  • 13. KDD Process • Selection: Obtain data from various sources. • Preprocessing: Cleanse data. • Transformation: Convert to common format. Transform to new format. • Data Mining: Obtain desired results. • Interpretation/Evaluation: Present results to user in meaningful manner. ©ArunPhadke 13 Modified from [FPSS96C] 2015-16
  • 14. KDD Process Ex: Web Log • Selection: – Select log data (dates and locations) to use • Preprocessing: – Remove identifying URLs – Remove error logs • Transformation: – Sessionize (sort and group) • Data Mining: – Identify and count patterns – Construct data structure • Interpretation/Evaluation: – Identify and display frequently accessed sequences. • Potential User Applications: – Cache prediction – Personalization ©ArunPhadke 142015-16
  • 15. Data Mining Development ©ArunPhadke 15 Information Retrieval •Similarity Measures •Hierarchical Clustering •IR Systems •Imprecise Queries •Textual Data •Web Search Engines Statistics •Bayes Theorem •Regression Analysis •EM Algorithm •K-Means Clustering •Time Series Analysis Machine Learning •Neural Networks •Decision Tree Algorithms Algorithm •Algorithm Design Techniques •Algorithm Analysis •Data Structures Databases •Relational Data Model •SQL •Association Rule Algorithms •Data Warehousing •Scalability Techniques 2015-16
  • 16. KDD Issues • Human Interaction • Overfitting • Outliers • Interpretation • Visualization • Large Datasets • High Dimensionality ©ArunPhadke 162015-16
  • 17. KDD Issues (cont’d) • Multimedia Data • Missing Data • Irrelevant Data • Noisy Data • Changing Data • Integration • Application ©ArunPhadke 172015-16
  • 18. Social Implications of DM • Privacy • Profiling • Unauthorized use ©ArunPhadke 182015-16
  • 19. Data Mining Metrics • Usefulness • Return on Investment (ROI) • Accuracy • Space/Time ©ArunPhadke 192015-16
  • 20. Database Perspective on Data Mining • Scalability • Real World Data • Updates • Ease of Use ©ArunPhadke 202015-16
  • 21. Visualization Techniques • Graphical • Geometric • Icon-based • Pixel-based • Hierarchical • Hybrid ©ArunPhadke 212015-16
  • 22. Related Concepts Outline • Database/OLTP Systems • Fuzzy Sets and Logic • Information Retrieval(Web Search Engines) • Dimensional Modeling • Data Warehousing • OLAP/DSS • Statistics • Machine Learning • Pattern Matching ©ArunPhadke 22 Goal: Examine some areas which are related to data mining. 2015-16
  • 23. DB & OLTP Systems • Schema – (ID,Name,Address,Salary,JobNo) • Data Model – ER – Relational • Transaction • Query: SELECT Name FROM T WHERE Salary > 100000 DM: Only imprecise queries ©ArunPhadke 232015-16
  • 24. Fuzzy Sets and Logic • Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. • f(x): Probability x is in F. • 1-f(x): Probability x is not in F. • EX: – T = {x | x is a person and x is tall} – Let f(x) be the probability that x is tall – Here f is the membership function DM: Prediction and classification are fuzzy. ©ArunPhadke 242015-16
  • 26. Classification/Prediction is Fuzzy ©ArunPhadke 26 Loan Amnt Simple Fuzzy Accept Accept Reject Reject 2015-16
  • 28. Data Cube Technology • Data Cube Computation: Preliminary Concepts • Data Cube Computation Methods • Processing Advanced Queries by Exploring Data Cube Technology • Multidimensional Data Analysis in Cube Space 2015-16 ©ArunPhadke 28
  • 29. Data Cube : Lattice of Cuboids 2015-16 ©ArunPhadke 29 time,item time,item,location time, item, location, supplierc all time item location supplier time,location time,supplier item,location item,supplier location,supplier time,item,supplier time,location,supplier item,location,supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid
  • 30. Data Cube : Lattice of Cuboids • D0 Cube .. All …Zero Dimensions • D1 Cube …..One Dimension … 4 Cubes – Time – Item – Location – Supplier 2015-16 ©ArunPhadke 30
  • 31. Data Cube : Lattice of Cuboids • D2 Cube .. Two Dimensions … 6 cubes – Time-Item – Time-Location – Time-Supplier – Item-Location – Item-Supplier – Location-Supplier 2015-16 ©ArunPhadke 31
  • 32. Data Cube : Lattice of Cuboids • D3 Cube .. Three Dimensions … 4 cubes – Time-Item-Location – Time-Item-Supplier – Time-Location-supplier – Item-Location-supplier • D4 Cube ….Four Dimensions … 1 Cube • No of Cubes = 2{no of dimensions} 2015-16 ©ArunPhadke 32
  • 33. Size of Cubes • Four dimensions – Time – yyyymmdd with 10 years ( 10x365 ) – Item – 1000 items – Location – 20 locations – Supplier – 500 suppliers • Maximum size = 10*365*1000*20*500 = 36,500,000,000 2015-16 ©ArunPhadke 33
  • 34. How to improve performance • Select the right cube – Time-Item - 10*365*1000 = 3,650,000 – Time-Location - 10*365*20 = 73,000 – Time-Supplier - 10*365*500 = 1,825,000 – Item-Location – 1000*20 = 20,000 – Item-Supplier – 1000*500 = 500,000 – Location-Supplier - 20*500 = 10,000 – Time-Item-Location - 10*365*1000*20 =73,000,000 – Time-Item-Supplier - 10*365*1000*500 = 1,825,000,000 – Time-Location-supplier - 10*365*20*500 = 36,500,000 – Item-Location-supplier – 1000*20*500 = 10,000,000 2015-16 ©ArunPhadke 34
  • 35. Materialization – Pre-computing • On-line analytical processing may need to access different cuboids for different queries • Compute some cuboids in advance – Pre-computation leads to fast response times – Most products support to some degree pre- computation 2015-16 ©ArunPhadke 35
  • 36. Materialization – Pre-computing • Storage space may explode... – If there are no hierarchies the total number for n-dimensional cube is 2n • But.... – Many dimensions may have hierarchies, for example time • day < week < month < quarter < year – Explosion of cuboids 2015-16 ©ArunPhadke 36
  • 37. Efficient computation of Data Cubes • Smallest-child: computing a cuboid from the smallest, previously computed cuboid • Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os • Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads • Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used • Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used 2015-16 ©ArunPhadke 37
  • 38. Multi-Array Aggragation • Array-based “bottom-up” algorithm • Using multi-dimensional chunks • No direct tuple comparisons • Simultaneous aggregation on multiple dimensions • Intermediate aggregate values are re-used for computing ancestor cuboids 2015-16 ©ArunPhadke 38 all A B AB ABC AC BC C
  • 39. Multi-way Array Aggregation for Cube Computation (MOLAP) • Partition arrays into chunks (a small subcube which fits in memory). • Compressed sparse array addressing: (chunk_id, offset) • Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. 2015-16 ©ArunPhadke 39A B 29 30 31 32 1 2 3 4 5 9 13 14 15 16 64636261 48474645 a1a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2 a3 C B 44 28 56 40 24 52 36 20 60
  • 40. Multi-way Array Aggregation for Cube Computation (MOLAP) 2015-16 ©ArunPhadke 40 A B 29 30 31 32 1 2 3 4 5 9 13 14 15 16 64636261 48474645 a1a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2 a3 C 44 28 56 40 24 52 36 20 60 B
  • 41. Multi-way Array Aggregation for Cube Computation (MOLAP) • Method: the planes should be sorted and computed according to their size in ascending order – Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane • Limitation of the method: computing well only for a small number of dimensions – If there are a large number of dimensions, “top-down” computation and iceberg cube computation methods can be explored 2015-16 ©ArunPhadke 41
  • 42. Iceberg Cube • An Iceberg-Cube contains only those cells of the data cube that meet an aggregate condition • It is called an Iceberg-Cube because it contains only some of the cells of the full cube, like the tip of an iceberg 2015-16 ©ArunPhadke 42
  • 43. Iceberg Cube • The purpose of the Iceberg-Cube is to identify and compute only those values that will most likely be required for decision support queries 2015-16 ©ArunPhadke 43
  • 44. Iceberg Cube example Part StoreLocati on Customer P1 Vancouver Vance P1 Calgary Bob P1 Toronto Richard P2 Toronto Allison P2 Toronto Allison P2 Toronto Tom P2 Ottawa Allison P3 Montreal Anne 2015-16 ©ArunPhadke 44 Combination Count {P1, ANY, ANY} 3 {P2, ANY, ANY} 4 {ANY, Toronto, ANY} 4 {ANY, ANY, Allison} 3 {P2, Toronto, ANY} 3 {P2, ANY, Allison} 3 {ANY, Toronto, Allison} 2 {P2, Toronto, Allison} 2 • Minimum support is 25% of tuples i.e. 2 tuples and we want to create an Iceberg-Cube
  • 45. APRIORI algorithm • The APRIORI algorithm uses candidate combinations to avoid counting every possible combination of attribute values. • For a combination of attribute values to satisfy the minimum support requirement, all subsets of that combination must also satisfy minimum support. 2015-16 ©ArunPhadke 45
  • 46. APRIORI algorithm • The candidate combinations are found by combining only the frequent attribute value combinations that are already known • All other possible combinations are automatically eliminated because not all of their subsets would satisfy the minimum support requirement 2015-16 ©ArunPhadke 46
  • 47. APRIORI A B C D a1 b1 c3 d1 a1 b5 c1 d2 a1 b2 c5 d2 a2 b2 c2 d2 a2 b2 c2 d4 a2 b2 c4 d2 a2 b3 c2 d3 a3 b4 c6 d2 2015-16 ©ArunPhadke 47 Combin ation Count {a1} 3 {a2} 4 {b2} 4 {c2} 3 {d2} 5 • On the first pass over the data, the APRIORI algorithm determines that the single values shown in Table
  • 48. APRIORI 2015-16 ©ArunPhadke 48 Comb inatio n Count {a1} 3 {a2} 4 {b2} 4 {c2} 3 {d2} 5 • On the Second pass, the APRIORI algorithm determines that the single values shown in Table Combination {a1,b2} {a1,c2} {a1,d2} {a2, b2} {a2,c2} {a2,d2} {b2,c2} {b2,d2} {c2,d2} Combination Count {a1,d2} 2 {a2,b2} 3 {a2,c2} 3 {b2,c2} 2 {a2,d2} 2 {b2,d2} 3
  • 49. Top-Down Computation • The algorithm begins by computing the frequent attribute value combinations for the attribute set at the top of the tree, in this case ABCD. 2015-16 ©ArunPhadke 49
  • 50. Top-Down Computation • On the same pass over the data, tdC counts value combinations for ABCD, ABC, AB and A, adding the frequent ones to the Iceberg-Cube 2015-16 ©ArunPhadke 50
  • 51. Top-Down Computation A B C D a1 b1 c3 d1 a1 b2 c5 d2 a1 b5 c1 d2 a2 b2 c2 d2 a2 b2 c2 d4 a2 b2 c4 d2 a2 b3 c2 d3 a3 b4 c6 d2 2015-16 ©ArunPhadke 51 Combination Count {a1} 3 {a2,b2,c2} 2 {a2,b2} 3 {a2} 4 Ordered by A,B,C,D Iceberg-Cube of ABCD
  • 52. Top-Down Computation A B D a1 b1 d1 a1 b2 d2 a1 b5 d2 a2 b2 d2 a2 b2 d2 a2 b2 d4 a2 b3 d3 a3 b4 d2 2015-16 ©ArunPhadke 52 Combination Count {a1} 3 {a2,b2,d2} 2 {a2,b2} 3 {a2} 4 Ordered by A,B,D Iceberg-Cube of ABD
  • 53. Top-Down Computation 2015-16 ©ArunPhadke 53 Final Iceberg-Cube Combination Count {a1} 3 {a2} 4 {a2,b2} 3 {a2,b2,c2} 2 {a2,b2,d2} 2 {a2,c2} 3 {a1,d2} 2 {a2,d2} 2 {b2,c2} 2 {b2} 4 {b2,d2} 3 {c2} 3 {d2} 5
  • 54. Bit-Map Indexes • New indexing techniques: Bitmap indexes, Join indexes, array representations, compression, pre- computation of aggregations, etc. 2015-16 ©ArunPhadke 54 112 Joe M 3 115 Ram M 5 119 Sue F 5 112 Woo M 4 10 10 01 10 00100 00001 00001 00010 Cust ID, Name,Sex, Rating Rating Sex Bit vector possible for each Value
  • 55. OLAP Vendor list • IBM • Infor • Oracle OLAP • SAS • SAP BW • Microsoft (SQL Server OLAP) • Micro-strategy Corporation 2015-16 ©ArunPhadke 55