SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Data Warehouse and Data Mining
BY
Dr
. ANUPAM GHOSH Date: 17.01.23
Email: anupam.ghosh@rediffmail.com
https://vidwan.inflibnet.ac.in/profile/319457
Academic Profile: https://www.nsec.ac.in/fps/faculty.php?id=138
Research Profile: https://www.researchgate.net/profile/Anupam-Ghosh-5
Professional Profile: https://www.linkedin.com/in/anupam-ghosh-1504273b/?originalSubdomain=in
Data Mining: A KDD Process
discovery process.
Databases
Task-relevant Data
Data Selection
Data Preprocessing
Data Warehouse
Data Cleaning
Data Integration
Data Mining
Pattern Evaluation
– Data mining: the core of knowledge
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
StatisticalAnalysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database
T
echnology
Statistics
Other
Disciplines
Information
Science
Machine
Learning
Visualization
Clustering
• Clustering: Intuitively, finding clusters of points in the given data such that
similar points lie in the same cluster
• Can be formalized using distance metrics in several ways
– Group points into k sets (for a given k) such that the average distance of
points from the centroid of their assigned group is minimized
• Centroid: point defined by taking average of coordinates in each
dimension.
– Another metric: minimize average distance between every pair of points
in a cluster
• Has been studied extensively in statistics, but on small data sets
– Data mining systems aim at clustering techniques that can handle very
large data sets
– E.g., the Birch clustering algorithm (more shortly)
Classification
Supervised Classification
Training samples are labeled
Classification
• Data mining is the process of semi-automatically analyzing large databases to
find useful patterns
• Prediction based on past history
• Predict if a credit card applicant poses a good credit risk, based on some
attributes (income, job type, age, ..) and past history
• Predict if a pattern of phone calling card usage is likely to be fraudulent
• Some examples of prediction mechanisms:
• Classification
• Given a new item whose class is unknown, predict to which class it belongs
• Regression formulae
• Given a set of mappings for an unknown function, predict the function result for a new parameter
value
Linear Regression
❑ Linear regression and modelling problems
are presented along with theirsolutions.
❑ If the plot of n pairs of data (x , y) for an
experiment appear to indicate a "linear
relationship" between y and x, then the
method of least squares may be used to
write a linear relationship between x and y.
❑ Linear regression is a linear model, e.g. a
model that assumes a linear relationship
between the input variables (x) and the
single output variable (y). More
specifically, that y can be calculated from
a linear combination of the input variables
(x
▶ The least square regression line for the set of n data points is given by the equation of a line in
slope intercept form:
▶ y =a x +b
Troubleshooting --Problem 1
Consider the following set of points: {(-2 , -1) , (1 , 1) , (3 , 2)}
a) Find the least square regression line for the given data points.
b) Plot the given points and the regression line in the same rectangular system of axes.
Problem 2
a) Find the least square regression line for the following set of data
{
(-1 , 0),(0 , 2),(1 , 4),(2 , 5)}
b) Plot the given points and the regression line in the same rectangular system of axes.
Problem 3
▶ The values of y and their corresponding values of y are shown in the table below
X 0 1 2 3 4
y 2 3 5 4 6
a) Find the least square regression line y =a x +b.
b) Estimate the value of y when x =10.
Problem 4
▶ The sales of a company (in million dollars) for each year are shown in the table below.
x (year) 2005
y (sales) 12
2006 2007 2008 2009
19 29 37 45
▶ a) Find the least square regression line y =a x +b.
▶ b) Use the least squares regression line as a model to estimate the sales of the company in 2012.
Decision Theory
Supervised Learning
Which Attribute is ”best”?
 We would like to select the attribute that is most useful for classifying examples.
 • Information gain measures how well a given attribute separates the training examples according to their
target classification.
 • ID3 uses this information gain measure to select among the candidate attributes at each step while
growing the tree.
 • In order to define information gain precisely, we use a measure commonly used in information theory,
called entropy
 • Entropy characterizes the (im)purity of an arbitrary collection of examples.
Information Theory –ID3 (Iterative Dichotomiser 3)
❖ ID3 algorithm invented by Ross Quinlan and uses information gain as its attribute selection measure
❖ This measure is based on pioneering work by Claude Shannon on information theory, which studied the
value or “information content” of messages
❖ Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is
chosen as the splitting attribute for node N
❖ This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects
the least randomness or “impurity” in these partitions
❖ The expected information needed to classify a tuple in D is given by
 Let D, the data partition, be a training set of class-labeled tuples. Suppose the class label attribute has m
distinct values defining m distinct classes, Ci (here I = 1 to m); pi = si/s; s= no. of samples; si = no. of samples in
class label Ci ; Info(D) is also known as the entropy of D
ID3--Continued
 suppose we were to partition the tuples in D on some attribute A having v distinct
values, [a1,a2, … , av], as observed from the training data. If A is discrete-valued, these
values correspond directly to the v outcomes of a test on A. Attribute A can be used to
split D into v partitions or subsets, [D1, D2, …, Dv], where Djcontains those tuples in D that
have outcomeajof A
 Here, |Dj|/|D|= acts as the weight of the jthpartition; InfoA(D) is the expected
information required to classify a tuple from D based on the partitioning by A.
 Info(Dj) = -σ𝑖=1
𝑚
𝑝ij log2(pij); pij= sij/|Dj|; sij = no. of samples belongs to class label Ci and
having the attribute value aj
ID3--Continued
 Information gain is defined as the difference between the original information requirement (i.e., based on
just the proportion of classes) and the new requirement (i.e., obtained after partitioning on A).
 In other words, Gain(A) tells us how much would be gained by branching on A. It is the expected reduction
in the information requirement caused by knowing the value of A. The attribute A with the highest
information gain, Gain(A), is chosen as the splitting attribute at node N.
Problem statement: Find out Test Attribute
Solution:
Entropy(Syouth) = − σ𝑖=1
2
𝑝i1 log2(pi1)
= - p11log2(p11) - p21log2(p21)
= -2/5 log2(2/5) – 3/5 log2(3/5)
= 0.971
Here, p11= s11/|D1| = 2/5
p21 = s21/|D1| = 3/5
log2 X = log10 X / log10 2
Entropy(Smiddle) = − σ𝑖=1
2
𝑝i2 log2(pi2)
= - p12log2(p12) - p22log2(p22)
= -4/4 log2(4/4) – 0/4 log2(0/4)
= 0
Here, p12= s12/|D2| = 4/4
p22 = s22/|D2| = 0/4
Decision Tree
X = (age = youth, income = medium, student = yes, credit = fair) Class label=?
DWDM-AG-day-1-2023-SEC A plus Half B--.pdf

Mais conteúdo relacionado

Semelhante a DWDM-AG-day-1-2023-SEC A plus Half B--.pdf

Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdfMariaKhan905189
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Classfication Basic.ppt
Classfication Basic.pptClassfication Basic.ppt
Classfication Basic.ppthenonah
 
Deployment of ID3 decision tree algorithm for placement prediction
Deployment of ID3 decision tree algorithm for placement predictionDeployment of ID3 decision tree algorithm for placement prediction
Deployment of ID3 decision tree algorithm for placement predictionijtsrd
 
Chapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptChapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptSubrata Kumer Paul
 
Improvement of id3 algorithm based on simplified information entropy and coor...
Improvement of id3 algorithm based on simplified information entropy and coor...Improvement of id3 algorithm based on simplified information entropy and coor...
Improvement of id3 algorithm based on simplified information entropy and coor...MdAhasanulAlam
 
DM Unit-III ppt.ppt
DM Unit-III ppt.pptDM Unit-III ppt.ppt
DM Unit-III ppt.pptLaxmi139487
 
Cs501 classification prediction
Cs501 classification predictionCs501 classification prediction
Cs501 classification predictionKamal Singh Lodhi
 
Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.pptRohit Raj
 
unit classification.pptx
unit  classification.pptxunit  classification.pptx
unit classification.pptxssuser908de6
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesAbhishekKumar4995
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)eSAT Journals
 
Dataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxDataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxHimanshuSharma997566
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
08 classbasic
08 classbasic08 classbasic
08 classbasicengrasi
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theorycsandit
 

Semelhante a DWDM-AG-day-1-2023-SEC A plus Half B--.pdf (20)

Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Classfication Basic.ppt
Classfication Basic.pptClassfication Basic.ppt
Classfication Basic.ppt
 
Deployment of ID3 decision tree algorithm for placement prediction
Deployment of ID3 decision tree algorithm for placement predictionDeployment of ID3 decision tree algorithm for placement prediction
Deployment of ID3 decision tree algorithm for placement prediction
 
Chapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptChapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.ppt
 
Chapter 4.pdf
Chapter 4.pdfChapter 4.pdf
Chapter 4.pdf
 
Improvement of id3 algorithm based on simplified information entropy and coor...
Improvement of id3 algorithm based on simplified information entropy and coor...Improvement of id3 algorithm based on simplified information entropy and coor...
Improvement of id3 algorithm based on simplified information entropy and coor...
 
DM Unit-III ppt.ppt
DM Unit-III ppt.pptDM Unit-III ppt.ppt
DM Unit-III ppt.ppt
 
Cs501 classification prediction
Cs501 classification predictionCs501 classification prediction
Cs501 classification prediction
 
Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.ppt
 
unit classification.pptx
unit  classification.pptxunit  classification.pptx
unit classification.pptx
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT Slides
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)
 
Dataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxDataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptx
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
08 classbasic
08 classbasic08 classbasic
08 classbasic
 
08 classbasic
08 classbasic08 classbasic
08 classbasic
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theory
 

Último

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 

Último (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 

DWDM-AG-day-1-2023-SEC A plus Half B--.pdf

  • 1. Data Warehouse and Data Mining BY Dr . ANUPAM GHOSH Date: 17.01.23 Email: anupam.ghosh@rediffmail.com https://vidwan.inflibnet.ac.in/profile/319457 Academic Profile: https://www.nsec.ac.in/fps/faculty.php?id=138 Research Profile: https://www.researchgate.net/profile/Anupam-Ghosh-5 Professional Profile: https://www.linkedin.com/in/anupam-ghosh-1504273b/?originalSubdomain=in
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9. Data Mining: A KDD Process discovery process. Databases Task-relevant Data Data Selection Data Preprocessing Data Warehouse Data Cleaning Data Integration Data Mining Pattern Evaluation – Data mining: the core of knowledge
  • 10. Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration StatisticalAnalysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
  • 11. Data Mining: Confluence of Multiple Disciplines Data Mining Database T echnology Statistics Other Disciplines Information Science Machine Learning Visualization
  • 12.
  • 13. Clustering • Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in the same cluster • Can be formalized using distance metrics in several ways – Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized • Centroid: point defined by taking average of coordinates in each dimension. – Another metric: minimize average distance between every pair of points in a cluster • Has been studied extensively in statistics, but on small data sets – Data mining systems aim at clustering techniques that can handle very large data sets – E.g., the Birch clustering algorithm (more shortly)
  • 16. Classification • Data mining is the process of semi-automatically analyzing large databases to find useful patterns • Prediction based on past history • Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past history • Predict if a pattern of phone calling card usage is likely to be fraudulent • Some examples of prediction mechanisms: • Classification • Given a new item whose class is unknown, predict to which class it belongs • Regression formulae • Given a set of mappings for an unknown function, predict the function result for a new parameter value
  • 17. Linear Regression ❑ Linear regression and modelling problems are presented along with theirsolutions. ❑ If the plot of n pairs of data (x , y) for an experiment appear to indicate a "linear relationship" between y and x, then the method of least squares may be used to write a linear relationship between x and y. ❑ Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x
  • 18. ▶ The least square regression line for the set of n data points is given by the equation of a line in slope intercept form: ▶ y =a x +b
  • 19. Troubleshooting --Problem 1 Consider the following set of points: {(-2 , -1) , (1 , 1) , (3 , 2)} a) Find the least square regression line for the given data points. b) Plot the given points and the regression line in the same rectangular system of axes.
  • 20.
  • 21. Problem 2 a) Find the least square regression line for the following set of data { (-1 , 0),(0 , 2),(1 , 4),(2 , 5)} b) Plot the given points and the regression line in the same rectangular system of axes.
  • 22.
  • 23. Problem 3 ▶ The values of y and their corresponding values of y are shown in the table below X 0 1 2 3 4 y 2 3 5 4 6 a) Find the least square regression line y =a x +b. b) Estimate the value of y when x =10.
  • 24.
  • 25. Problem 4 ▶ The sales of a company (in million dollars) for each year are shown in the table below. x (year) 2005 y (sales) 12 2006 2007 2008 2009 19 29 37 45 ▶ a) Find the least square regression line y =a x +b. ▶ b) Use the least squares regression line as a model to estimate the sales of the company in 2012.
  • 26.
  • 28. Which Attribute is ”best”?  We would like to select the attribute that is most useful for classifying examples.  • Information gain measures how well a given attribute separates the training examples according to their target classification.  • ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree.  • In order to define information gain precisely, we use a measure commonly used in information theory, called entropy  • Entropy characterizes the (im)purity of an arbitrary collection of examples.
  • 29. Information Theory –ID3 (Iterative Dichotomiser 3) ❖ ID3 algorithm invented by Ross Quinlan and uses information gain as its attribute selection measure ❖ This measure is based on pioneering work by Claude Shannon on information theory, which studied the value or “information content” of messages ❖ Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen as the splitting attribute for node N ❖ This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions ❖ The expected information needed to classify a tuple in D is given by  Let D, the data partition, be a training set of class-labeled tuples. Suppose the class label attribute has m distinct values defining m distinct classes, Ci (here I = 1 to m); pi = si/s; s= no. of samples; si = no. of samples in class label Ci ; Info(D) is also known as the entropy of D
  • 30. ID3--Continued  suppose we were to partition the tuples in D on some attribute A having v distinct values, [a1,a2, … , av], as observed from the training data. If A is discrete-valued, these values correspond directly to the v outcomes of a test on A. Attribute A can be used to split D into v partitions or subsets, [D1, D2, …, Dv], where Djcontains those tuples in D that have outcomeajof A  Here, |Dj|/|D|= acts as the weight of the jthpartition; InfoA(D) is the expected information required to classify a tuple from D based on the partitioning by A.  Info(Dj) = -σ𝑖=1 𝑚 𝑝ij log2(pij); pij= sij/|Dj|; sij = no. of samples belongs to class label Ci and having the attribute value aj
  • 31. ID3--Continued  Information gain is defined as the difference between the original information requirement (i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning on A).  In other words, Gain(A) tells us how much would be gained by branching on A. It is the expected reduction in the information requirement caused by knowing the value of A. The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at node N.
  • 32. Problem statement: Find out Test Attribute
  • 33. Solution: Entropy(Syouth) = − σ𝑖=1 2 𝑝i1 log2(pi1) = - p11log2(p11) - p21log2(p21) = -2/5 log2(2/5) – 3/5 log2(3/5) = 0.971 Here, p11= s11/|D1| = 2/5 p21 = s21/|D1| = 3/5 log2 X = log10 X / log10 2 Entropy(Smiddle) = − σ𝑖=1 2 𝑝i2 log2(pi2) = - p12log2(p12) - p22log2(p22) = -4/4 log2(4/4) – 0/4 log2(0/4) = 0 Here, p12= s12/|D2| = 4/4 p22 = s22/|D2| = 0/4
  • 34.
  • 35. Decision Tree X = (age = youth, income = medium, student = yes, credit = fair) Class label=?