Dw-dm-part-01

Data Warehousing
and
Data Mining
©ArunPhadke 12015-16

Introduction Outline
• Define data mining
• Data mining vs. databases
• Basic data mining tasks
• Data mining development
• Data mining issues

Introduction
• Data is growing at a phenomenal rate
• Users expect more sophisticated
information
• How?
©ArunPhadke 3
UNCOVER HIDDEN INFORMATION
DATA MINING
2015-16

Data Mining Definition
• Finding hidden information in a database
• Fit data to a model
• Similar terms
– Exploratory data analysis
– Data driven discovery
– Deductive learning

Data Mining Algorithm
• Objective: Fit Data to a Model
– Descriptive
– Predictive
• Preference – Technique to choose the
best model
• Search – Technique to search the data
– “Query”

Database Processing vs. Data Mining
Processing
• Query
– Well defined
– SQL
• Query
– Poorly defined
– No precise query language
©ArunPhadke 6
 Data
– Operational data
 Output
– Precise
– Subset of database
 Data
– Not operational data
 Output
– Fuzzy
– Not a subset of database
2015-16

Query Examples
2015-16 ©ArunPhadke 7
• Database
– Find all credit applicants with last name of Smith
– Identify customers who have purchased more than
$10,000 in the last month
– Find all customers who have purchased milk
• Data Mining
– Find all credit applicants who are poor credit risks.
(classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased with
milk. (association rules)

Data Mining Models and Tasks

Basic Data Mining Tasks
• Classification maps data into predefined
groups or classes
– Supervised learning
– Pattern recognition
– Prediction
• Regression is used to map a data item to a
real valued prediction variable.
• Clustering groups similar data together into
clusters.
– Unsupervised learning
– Segmentation
– Partitioning

Basic Data Mining Tasks (cont’d)
• Summarization maps data into subsets with
associated simple descriptions.
– Characterization
– Generalization
• Link Analysis uncovers relationships among
data.
– Affinity Analysis
– Association Rules
– Sequential Analysis determines sequential
patterns.

Ex: Time Series Analysis
• Example: Stock Market
• Predict future values
• Determine similar patterns over time
• Classify behavior
©ArunPhadke 11©ArunPhadke 112015-16

Data Mining vs. KDD
• Knowledge Discovery in Databases
(KDD): process of finding useful
information and patterns in data.
• Data Mining: Use of algorithms to extract
the information and patterns derived by
the KDD process.

KDD Process
• Selection: Obtain data from various
sources.
• Preprocessing: Cleanse data.
• Transformation: Convert to common
format. Transform to new format.
• Data Mining: Obtain desired results.
• Interpretation/Evaluation: Present results
to user in meaningful manner.
©ArunPhadke 13
Modified from [FPSS96C]
2015-16

KDD Process Ex: Web Log
• Selection:
– Select log data (dates and locations) to use
• Preprocessing:
– Remove identifying URLs
– Remove error logs
• Transformation:
– Sessionize (sort and group)
• Data Mining:
– Identify and count patterns
– Construct data structure
• Interpretation/Evaluation:
– Identify and display frequently accessed sequences.
• Potential User Applications:
– Cache prediction
– Personalization

Data Mining Development
©ArunPhadke 15
Information Retrieval
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
Statistics
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
Machine Learning
•Neural Networks
•Decision Tree Algorithms
Algorithm
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
Databases
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
2015-16

KDD Issues
• Human Interaction
• Overfitting
• Outliers
• Interpretation
• Visualization
• Large Datasets
• High Dimensionality

KDD Issues (cont’d)
• Multimedia Data
• Missing Data
• Irrelevant Data
• Noisy Data
• Changing Data
• Integration
• Application

Social Implications of DM
• Privacy
• Profiling
• Unauthorized use

Data Mining Metrics
• Usefulness
• Return on Investment (ROI)
• Accuracy
• Space/Time

Database Perspective on Data Mining
• Scalability
• Real World Data
• Updates
• Ease of Use

Visualization Techniques
• Graphical
• Geometric
• Icon-based
• Pixel-based
• Hierarchical
• Hybrid

Related Concepts Outline
• Database/OLTP Systems
• Fuzzy Sets and Logic
• Information Retrieval(Web Search
Engines)
• Dimensional Modeling
• Data Warehousing
• OLAP/DSS
• Statistics
• Machine Learning
• Pattern Matching
©ArunPhadke 22
Goal: Examine some areas which are related to
data mining.
2015-16

DB & OLTP Systems
• Schema
– (ID,Name,Address,Salary,JobNo)
• Data Model
– ER
– Relational
• Transaction
• Query:
SELECT Name
FROM T
WHERE Salary > 100000
DM: Only imprecise queries

Fuzzy Sets and Logic
• Fuzzy Set: Set membership function is a real
valued function with output in the range [0,1].
• f(x): Probability x is in F.
• 1-f(x): Probability x is not in F.
• EX:
– T = {x | x is a person and x is tall}
– Let f(x) be the probability that x is tall
– Here f is the membership function
DM: Prediction and classification are fuzzy.

Fuzzy Sets

Classification/Prediction is Fuzzy
©ArunPhadke 26
Loan
Amnt
Simple Fuzzy
Accept Accept
Reject
Reject
2015-16

Data Warehouse
Data-warehouse-03.pptx

Data Cube Technology
• Data Cube Computation: Preliminary
Concepts
• Data Cube Computation Methods
• Processing Advanced Queries by Exploring
Data Cube Technology
• Multidimensional Data Analysis in Cube
Space

Data Cube : Lattice of Cuboids
time,item
time,item,location
time, item, location, supplierc
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid

• D0 Cube .. All …Zero Dimensions
• D1 Cube …..One Dimension … 4 Cubes
– Time
– Item
– Location
– Supplier

• D2 Cube .. Two Dimensions … 6 cubes
– Time-Item
– Time-Location
– Time-Supplier
– Item-Location
– Item-Supplier
– Location-Supplier

• D3 Cube .. Three Dimensions … 4 cubes
– Time-Item-Location
– Time-Item-Supplier
– Time-Location-supplier
– Item-Location-supplier
• D4 Cube ….Four Dimensions … 1 Cube
• No of Cubes = 2{no of dimensions}

Size of Cubes
• Four dimensions
– Time – yyyymmdd with 10 years ( 10x365 )
– Item – 1000 items
– Location – 20 locations
– Supplier – 500 suppliers
• Maximum size = 10*365*1000*20*500 =
36,500,000,000

How to improve performance
• Select the right cube
– Time-Item - 10*365*1000 = 3,650,000
– Time-Location - 10*365*20 = 73,000
– Time-Supplier - 10*365*500 = 1,825,000
– Item-Location – 1000*20 = 20,000
– Item-Supplier – 1000*500 = 500,000
– Location-Supplier - 20*500 = 10,000
– Time-Item-Location - 10*365*1000*20 =73,000,000
– Time-Item-Supplier - 10*365*1000*500 = 1,825,000,000
– Time-Location-supplier - 10*365*20*500 = 36,500,000
– Item-Location-supplier – 1000*20*500 = 10,000,000

Materialization – Pre-computing
• On-line analytical processing may need to
access different cuboids for different
queries
• Compute some cuboids in advance
– Pre-computation leads to fast response times
– Most products support to some degree pre-
computation

Materialization – Pre-computing
• Storage space may explode...
– If there are no hierarchies the total number
for n-dimensional cube is 2n
• But....
– Many dimensions may have hierarchies, for example
time
• day < week < month < quarter < year
– Explosion of cuboids

Efficient computation of Data Cubes
• Smallest-child: computing a cuboid from the smallest,
previously computed cuboid
• Cache-results: caching results of a cuboid from which other
cuboids are computed to reduce disk I/Os
• Amortize-scans: computing as many as possible cuboids at
the same time to amortize disk reads
• Share-sorts: sharing sorting costs cross multiple cuboids
when sort-based method is used
• Share-partitions: sharing the partitioning cost across multiple
cuboids when hash-based algorithms are used

Multi-Array Aggragation
• Array-based “bottom-up”
algorithm
• Using multi-dimensional chunks
• No direct tuple comparisons
• Simultaneous aggregation on
multiple dimensions
• Intermediate aggregate values
are re-used for computing
ancestor cuboids
all
A B
AB
ABC
AC BC
C

Multi-way Array Aggregation for Cube
Computation (MOLAP)
• Partition arrays into chunks (a small subcube which fits
in memory).
• Compressed sparse array addressing: (chunk_id, offset)
• Compute aggregates in “multiway” by visiting cube cells
in the order which minimizes the # of times to visit each
cell, and reduces memory access and storage cost.
2015-16 ©ArunPhadke 39A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
64636261
48474645
a1a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2 a3
C
B
44
28 56
40
24 52
36
20
60

Computation (MOLAP)
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
64636261
48474645
a1a0
c3
c2
c1
c 0
b3
b2
b1
b0
a2 a3
C
44
28 56
40
24 52
36
20
60
B

Computation (MOLAP)
• Method: the planes should be sorted and
computed according to their size in ascending
order
– Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the
largest plane
• Limitation of the method: computing well only
for a small number of dimensions
– If there are a large number of dimensions, “top-down”
computation and iceberg cube computation methods
can be explored

Iceberg Cube
• An Iceberg-Cube contains only those cells
of the data cube that meet an aggregate
condition
• It is called an Iceberg-Cube because it
contains only some of the cells of the full
cube, like the tip of an iceberg

Iceberg Cube
• The purpose of the Iceberg-Cube is to
identify and compute only those values
that will most likely be required for
decision support queries

Iceberg Cube example
Part
StoreLocati
on
Customer
P1 Vancouver Vance
P1 Calgary Bob
P1 Toronto Richard
P2 Toronto Allison
P2 Toronto Allison
P2 Toronto Tom
P2 Ottawa Allison
P3 Montreal Anne
Combination Count
{P1, ANY, ANY} 3
{P2, ANY, ANY} 4
{ANY, Toronto, ANY} 4
{ANY, ANY, Allison} 3
{P2, Toronto, ANY} 3
{P2, ANY, Allison} 3
{ANY, Toronto, Allison} 2
{P2, Toronto, Allison} 2
• Minimum support is 25% of tuples i.e. 2 tuples
and we want to create an Iceberg-Cube

APRIORI algorithm
• The APRIORI algorithm uses candidate
combinations to avoid counting every
possible combination of attribute values.
• For a combination of attribute values to
satisfy the minimum support requirement,
all subsets of that combination must also
satisfy minimum support.

APRIORI algorithm
• The candidate combinations are found by
combining only the frequent attribute
value combinations that are already
known
• All other possible combinations are
automatically eliminated because not all
of their subsets would satisfy the
minimum support requirement

APRIORI
A B C D
a1 b1 c3 d1
a1 b5 c1 d2
a1 b2 c5 d2
a2 b2 c2 d2
a2 b2 c2 d4
a2 b2 c4 d2
a2 b3 c2 d3
a3 b4 c6 d2
Combin
ation
Count
{a1} 3
{a2} 4
{b2} 4
{c2} 3
{d2} 5
• On the first pass over the data, the
APRIORI algorithm determines that the
single values shown in Table

APRIORI
Comb
inatio
n
Count
{a1} 3
{a2} 4
{b2} 4
{c2} 3
{d2} 5
• On the Second pass, the APRIORI
algorithm determines that the single
values shown in Table
Combination
{a1,b2}
{a1,c2}
{a1,d2}
{a2, b2}
{a2,c2}
{a2,d2}
{b2,c2}
{b2,d2}
{c2,d2}
Combination Count
{a1,d2} 2
{a2,b2} 3
{a2,c2} 3
{b2,c2} 2
{a2,d2} 2
{b2,d2} 3

Top-Down Computation
• The algorithm begins by computing the
frequent attribute value combinations for
the attribute set at the top of the tree, in
this case ABCD.

• On the same pass over the data, tdC
counts value combinations for ABCD,
ABC, AB and A, adding the frequent ones
to the Iceberg-Cube

A B C D
a1 b1 c3 d1
a1 b2 c5 d2
a1 b5 c1 d2
a2 b2 c2 d2
a2 b2 c2 d4
a2 b2 c4 d2
a2 b3 c2 d3
a3 b4 c6 d2
Combination Count
{a1} 3
{a2,b2,c2} 2
{a2,b2} 3
{a2} 4
Ordered by A,B,C,D
Iceberg-Cube of ABCD

A B D
a1 b1 d1
a1 b2 d2
a1 b5 d2
a2 b2 d2
a2 b2 d2
a2 b2 d4
a2 b3 d3
a3 b4 d2
Combination Count
{a1} 3
{a2,b2,d2} 2
{a2,b2} 3
{a2} 4
Ordered by A,B,D
Iceberg-Cube of ABD

Final Iceberg-Cube
Combination Count
{a1} 3
{a2} 4
{a2,b2} 3
{a2,b2,c2} 2
{a2,b2,d2} 2
{a2,c2} 3
{a1,d2} 2
{a2,d2} 2
{b2,c2} 2
{b2} 4
{b2,d2} 3
{c2} 3
{d2} 5

Bit-Map Indexes
• New indexing techniques: Bitmap
indexes, Join indexes, array
representations, compression, pre-
computation of aggregations, etc.
112 Joe M 3
115 Ram M 5
119 Sue F 5
112 Woo M 4
10
10
01
10
00100
00001
00001
00010
Cust ID, Name,Sex, Rating Rating
Sex
Bit vector
possible for
each Value

OLAP Vendor list
• IBM
• Infor
• Oracle OLAP
• SAS
• SAP BW
• Microsoft (SQL Server OLAP)
• Micro-strategy Corporation

Dw-dm-part-01

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Dw-dm-part-01

Semelhante a Dw-dm-part-01 (20)

Último

Último (20)

Dw-dm-part-01