1. Data Warehouse and Data Mining
BY
Dr
. ANUPAM GHOSH Date: 17.01.23
Email: anupam.ghosh@rediffmail.com
https://vidwan.inflibnet.ac.in/profile/319457
Academic Profile: https://www.nsec.ac.in/fps/faculty.php?id=138
Research Profile: https://www.researchgate.net/profile/Anupam-Ghosh-5
Professional Profile: https://www.linkedin.com/in/anupam-ghosh-1504273b/?originalSubdomain=in
2.
3.
4.
5.
6.
7.
8.
9. Data Mining: A KDD Process
discovery process.
Databases
Task-relevant Data
Data Selection
Data Preprocessing
Data Warehouse
Data Cleaning
Data Integration
Data Mining
Pattern Evaluation
– Data mining: the core of knowledge
10. Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
StatisticalAnalysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
11. Data Mining: Confluence of Multiple Disciplines
Data Mining
Database
T
echnology
Statistics
Other
Disciplines
Information
Science
Machine
Learning
Visualization
12.
13. Clustering
• Clustering: Intuitively, finding clusters of points in the given data such that
similar points lie in the same cluster
• Can be formalized using distance metrics in several ways
– Group points into k sets (for a given k) such that the average distance of
points from the centroid of their assigned group is minimized
• Centroid: point defined by taking average of coordinates in each
dimension.
– Another metric: minimize average distance between every pair of points
in a cluster
• Has been studied extensively in statistics, but on small data sets
– Data mining systems aim at clustering techniques that can handle very
large data sets
– E.g., the Birch clustering algorithm (more shortly)
16. Classification
• Data mining is the process of semi-automatically analyzing large databases to
find useful patterns
• Prediction based on past history
• Predict if a credit card applicant poses a good credit risk, based on some
attributes (income, job type, age, ..) and past history
• Predict if a pattern of phone calling card usage is likely to be fraudulent
• Some examples of prediction mechanisms:
• Classification
• Given a new item whose class is unknown, predict to which class it belongs
• Regression formulae
• Given a set of mappings for an unknown function, predict the function result for a new parameter
value
17. Linear Regression
❑ Linear regression and modelling problems
are presented along with theirsolutions.
❑ If the plot of n pairs of data (x , y) for an
experiment appear to indicate a "linear
relationship" between y and x, then the
method of least squares may be used to
write a linear relationship between x and y.
❑ Linear regression is a linear model, e.g. a
model that assumes a linear relationship
between the input variables (x) and the
single output variable (y). More
specifically, that y can be calculated from
a linear combination of the input variables
(x
18. ▶ The least square regression line for the set of n data points is given by the equation of a line in
slope intercept form:
▶ y =a x +b
19. Troubleshooting --Problem 1
Consider the following set of points: {(-2 , -1) , (1 , 1) , (3 , 2)}
a) Find the least square regression line for the given data points.
b) Plot the given points and the regression line in the same rectangular system of axes.
20.
21. Problem 2
a) Find the least square regression line for the following set of data
{
(-1 , 0),(0 , 2),(1 , 4),(2 , 5)}
b) Plot the given points and the regression line in the same rectangular system of axes.
22.
23. Problem 3
▶ The values of y and their corresponding values of y are shown in the table below
X 0 1 2 3 4
y 2 3 5 4 6
a) Find the least square regression line y =a x +b.
b) Estimate the value of y when x =10.
24.
25. Problem 4
▶ The sales of a company (in million dollars) for each year are shown in the table below.
x (year) 2005
y (sales) 12
2006 2007 2008 2009
19 29 37 45
▶ a) Find the least square regression line y =a x +b.
▶ b) Use the least squares regression line as a model to estimate the sales of the company in 2012.
28. Which Attribute is ”best”?
We would like to select the attribute that is most useful for classifying examples.
• Information gain measures how well a given attribute separates the training examples according to their
target classification.
• ID3 uses this information gain measure to select among the candidate attributes at each step while
growing the tree.
• In order to define information gain precisely, we use a measure commonly used in information theory,
called entropy
• Entropy characterizes the (im)purity of an arbitrary collection of examples.
29. Information Theory –ID3 (Iterative Dichotomiser 3)
❖ ID3 algorithm invented by Ross Quinlan and uses information gain as its attribute selection measure
❖ This measure is based on pioneering work by Claude Shannon on information theory, which studied the
value or “information content” of messages
❖ Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is
chosen as the splitting attribute for node N
❖ This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects
the least randomness or “impurity” in these partitions
❖ The expected information needed to classify a tuple in D is given by
Let D, the data partition, be a training set of class-labeled tuples. Suppose the class label attribute has m
distinct values defining m distinct classes, Ci (here I = 1 to m); pi = si/s; s= no. of samples; si = no. of samples in
class label Ci ; Info(D) is also known as the entropy of D
30. ID3--Continued
suppose we were to partition the tuples in D on some attribute A having v distinct
values, [a1,a2, … , av], as observed from the training data. If A is discrete-valued, these
values correspond directly to the v outcomes of a test on A. Attribute A can be used to
split D into v partitions or subsets, [D1, D2, …, Dv], where Djcontains those tuples in D that
have outcomeajof A
Here, |Dj|/|D|= acts as the weight of the jthpartition; InfoA(D) is the expected
information required to classify a tuple from D based on the partitioning by A.
Info(Dj) = -σ𝑖=1
𝑚
𝑝ij log2(pij); pij= sij/|Dj|; sij = no. of samples belongs to class label Ci and
having the attribute value aj
31. ID3--Continued
Information gain is defined as the difference between the original information requirement (i.e., based on
just the proportion of classes) and the new requirement (i.e., obtained after partitioning on A).
In other words, Gain(A) tells us how much would be gained by branching on A. It is the expected reduction
in the information requirement caused by knowing the value of A. The attribute A with the highest
information gain, Gain(A), is chosen as the splitting attribute at node N.