SlideShare uma empresa Scribd logo
1 de 21
DATA MINING
Submitted by
K.Lalithambiga.,Msc(cs),
Nadar Saraswathi College of Arts & Science.,
Theni.
Introduction
 Three types of attributes:
• Nominal — values from an unordered set.
• Ordinal — values from an ordered set.
• Continuous — real numbers.
Discretization:
• Divide the range of a continuous attribute into intervals
• Some classification algorithms only accept categorical
attributes.
• Reduce data size by Discretization.
• Prepare for further analysis.
 Discretization
• Reduce the number of values for a given continuous
attribute by dividing the range of the attribute into intervals.
• Interval labels can then be used to replace actual data
values.
 Concept hierarchies
• Reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior).
Binning.
Histogram analysis.
Clustering analysis.
Entropy-based discretization.
Segmentation by natural partitioning.
 Attribute values can be discretized by distributing the
values into bin and replacing each bin by the mean bin
value or bin median value.
 These technique can be applied recursively to the resulting
partitions in order to generate concept hierarchies.
 Binning does not use class information and unsupervised
discretization technique.
 It is sensitive to the user-specified number of bins.
Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N. The most
straightforward
• But outliers may dominate presentation
• Skewed data is not handled well.
Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing
approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-width) bins:
- Bin 1 (4-14): 4, 8, 9
- Bin 2(15-24): 15, 21, 21, 24
- Bin 3(25-34): 25, 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 7, 7, 7
- Bin 2: 20, 20, 20, 20
- Bin 3: 28, 28, 28, 28, 28
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4
- Bin 2: 15, 24, 24, 24
- Bin 3: 25, 25, 25, 25, 34
 Histograms can also be used for discretization.
 Partitioning rules can be applied to define range of values.
 The histogram analyses algorithm can be applied
recursively to each partition in order to automatically
generate a multilevel concept hierarchy, with the procedure
terminating once a prespecified number of concept levels
have been reached.
 A minimum interval size can be used per level to control the
recursive procedure.
 This specifies the minimum width of the partition, or the
minimum member of partitions at each level.
 A popular data reduction
technique
 Divide data into buckets and
store average (sum) for each
bucket
 Can be constructed
optimally in one dimension
using dynamic
programming
 Related to quantization
problems.
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
 Several techniques for determining buckets
• Equiwidth – width of each bucket range is uniform
• Equidepth – each bucket contains roughly the same number
of contiguous samples
• V-Optimal – weighted sum of the original values that each
bucket represents, where bucket weight = number of values
in a bucket
• MaxDiff – bucket boundary is established between each
pair for pairs having the B – 1 largest differences, where B
is user defined
 V-Optimal & MaxDiff most accurate and
practical
 A clustering algorithm can be applied to partition data
into clusters or groups.
 Each cluster forms a node of a concept hierarchy,
where all noses are at the same conceptual level.
 Each cluster may be further decomposed into sub-
clusters, forming a lower kevel in the hierarchy.
 Clusters may also be grouped together to form a
higher-level concept hierarchy.
 Can have hierarchical clustering and be stored in
multi-dimensional index tree structures.
Allows detection and removal of outliers
Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is
• S1 & S2 correspond to samples in S satisfying
conditions A<v & A>=v
The boundary that minimizes the entropy
function over all possible boundaries is
selected as a binary discretization.
E S T
S
Ent
S
Ent
S
S
S
S( , )
| |
| |
( )
| |
| |
( ) 1
1
2
2
The process is recursively applied to partitions
obtained until some stopping criterion is met,
e.g.,
Experiments show that it may reduce data size
and improve classification accuracy
Ent S E T S( ) ( , ) 
 3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
 If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into 3 equi-
width intervals.
 If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals.
 If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals.
(-$4000 -$5,000)
(-$4000 - 0)
(-$4000 -
-$3000)
(-$3000 -
-$2000)
(-$2000 -
-$1000)
(-$1000 -
0)
(0 - $1,000)
(0 -
$200)
($200 -
$400)
($400 -
$600)
($600 -
$800) ($800 -
$1,000)
($2,000 - $5, 000)
($2,000 -
$3,000)
($3,000 -
$4,000)
($4,000 -
$5,000)
($1,000 - $2, 000)
($1,000 -
$1,200)
($1,200 -
$1,400)
($1,400 -
$1,600)
($1,600 -
$1,800)
($1,800 -
$2,000)
msd=1,000 Low=-$1,000 High=$2,000Step 2:
Step 4:
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
count
(-$1,000 - $2,000)
(-$1,000 - 0) (0 -$ 1,000)
Step 3:
($1,000 - $2,000)
 Step 1 – Min=-$351,976, Max=$4,700,896, low (5th
percentile)=-$159,876, high (95th percentile)=$1,838,761.
 Step 2 – For low and high, most significant digit is at
$1,000,000, rounding low -$1,000,000, rounding high
$2,000,000.
 Step 3 – interval ranges over 3 distinct values at the most
significant digit, so using 3-4-5 rule partition into 3 intervals, -
$1,000,000-$0, $0-$1,000,000, and $1,000,000-$2,000,000.
 Step 4 – Examine Min & Max values to see how they “fit” into
first level partitions, first partition covers Min value, so adjust
left boundary to make partition smaller, last partition doesn’t
cover Max value, so create a new partition (round max up to next
significant digit) $2,000,000-$5,000,000.
 Step 5 – Recursively, each interval can be further partitioned
using 3-4-5 rule to form next lower level of the hierarchy .
 Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• Example : rel db may contain: street, city,
province_or_state, country
• Expert defines ordering of hierarchy such as street < city <
province_or_state < country
 Specification of a portion of a hierarchy by explicit data
grouping
• Example : province_or_state, country : {Alberta,
Saskatchewan, Manitoba} – prairies_Canada & {British
Columbia, prairies_Canada} – Western Canada
 Specification of a set of attributes, but not of their partial
ordering.
• Auto generate the attribute ordering based upon observation that
attribute defining a high level concept has a smaller # of distinct values
than an attribute defining a lower level concept
• Example : country (15), state_or_province (365), city (3567), street
(674,339)
 Specification of only a partial set of attributes
• Try and parse database schema to determine complete
hierarchy.
 Concept hierarchy can be automatically generated based on
the number of distinct values per attribute in the given
attribute set.
 The attribute with the most distinct values is placed at the
lowest level of the hierarchy.
country
province_or_ state
city
street
15 distinct values
65 distinct values
3567 distinct values
674,339 distinct values
Datamining

Mais conteúdo relacionado

Mais procurados

Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clusteringguest0edcaf
 
Frequency distribution
Frequency distributionFrequency distribution
Frequency distributionAishwarya PT
 
CLASSIFICATION AND TABULATION in Biostatic
CLASSIFICATION AND TABULATION in BiostaticCLASSIFICATION AND TABULATION in Biostatic
CLASSIFICATION AND TABULATION in BiostaticMuhammad Amir Sohail
 
Chapter 2 250110 083240
Chapter 2 250110 083240Chapter 2 250110 083240
Chapter 2 250110 083240guest25d353
 
Presentation of Data and Frequency Distribution
Presentation of Data and Frequency DistributionPresentation of Data and Frequency Distribution
Presentation of Data and Frequency DistributionElain Cruz
 
Chapter 02
Chapter 02Chapter 02
Chapter 02bmcfad01
 
QT1 - 02 - Frequency Distribution
QT1 - 02 - Frequency DistributionQT1 - 02 - Frequency Distribution
QT1 - 02 - Frequency DistributionPrithwis Mukerjee
 
Frequency Distributions and Graphs
Frequency Distributions and GraphsFrequency Distributions and Graphs
Frequency Distributions and Graphsmonritche
 
Frequency Tables, Frequency Distributions, and Graphic Presentation
Frequency Tables, Frequency Distributions, and Graphic PresentationFrequency Tables, Frequency Distributions, and Graphic Presentation
Frequency Tables, Frequency Distributions, and Graphic PresentationConflagratioNal Jahid
 
Probability and statistics (frequency distributions)
Probability and statistics (frequency distributions)Probability and statistics (frequency distributions)
Probability and statistics (frequency distributions)Don Bosco BSIT
 
Data pre processing
Data pre processingData pre processing
Data pre processingjunnubabu
 
Data array and frequency distribution
Data array and frequency distributionData array and frequency distribution
Data array and frequency distributionraboz
 
Chapter 2: Frequency Distribution and Graphs
Chapter 2: Frequency Distribution and GraphsChapter 2: Frequency Distribution and Graphs
Chapter 2: Frequency Distribution and GraphsMong Mara
 

Mais procurados (20)

Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Datamining
DataminingDatamining
Datamining
 
Frequency distribution
Frequency distributionFrequency distribution
Frequency distribution
 
CLASSIFICATION AND TABULATION in Biostatic
CLASSIFICATION AND TABULATION in BiostaticCLASSIFICATION AND TABULATION in Biostatic
CLASSIFICATION AND TABULATION in Biostatic
 
Data For Datamining
Data For DataminingData For Datamining
Data For Datamining
 
Chapter 2 250110 083240
Chapter 2 250110 083240Chapter 2 250110 083240
Chapter 2 250110 083240
 
Presentation of Data and Frequency Distribution
Presentation of Data and Frequency DistributionPresentation of Data and Frequency Distribution
Presentation of Data and Frequency Distribution
 
Chapter 02
Chapter 02Chapter 02
Chapter 02
 
QT1 - 02 - Frequency Distribution
QT1 - 02 - Frequency DistributionQT1 - 02 - Frequency Distribution
QT1 - 02 - Frequency Distribution
 
Frequency Distributions and Graphs
Frequency Distributions and GraphsFrequency Distributions and Graphs
Frequency Distributions and Graphs
 
Data reduction
Data reductionData reduction
Data reduction
 
Frequency Tables, Frequency Distributions, and Graphic Presentation
Frequency Tables, Frequency Distributions, and Graphic PresentationFrequency Tables, Frequency Distributions, and Graphic Presentation
Frequency Tables, Frequency Distributions, and Graphic Presentation
 
Classification & tabulation of data
Classification & tabulation of dataClassification & tabulation of data
Classification & tabulation of data
 
Probability and statistics (frequency distributions)
Probability and statistics (frequency distributions)Probability and statistics (frequency distributions)
Probability and statistics (frequency distributions)
 
Data discretization
Data discretizationData discretization
Data discretization
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data array and frequency distribution
Data array and frequency distributionData array and frequency distribution
Data array and frequency distribution
 
Chapter 2: Frequency Distribution and Graphs
Chapter 2: Frequency Distribution and GraphsChapter 2: Frequency Distribution and Graphs
Chapter 2: Frequency Distribution and Graphs
 

Semelhante a Datamining

CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
CLuster analysis presentation.pptx
CLuster analysis presentation.pptxCLuster analysis presentation.pptx
CLuster analysis presentation.pptxSAJANVERMA4
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.pptchatbot9
 
Classification Systems
Classification SystemsClassification Systems
Classification SystemsJohn Reiser
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
1) Chapter#02 Presentation of Data.ppt
1) Chapter#02 Presentation of Data.ppt1) Chapter#02 Presentation of Data.ppt
1) Chapter#02 Presentation of Data.pptMuntazirMehdi43
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningNandakumar P
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis Baivab Nag
 
Summarizing Data : Listing and Grouping pdf
Summarizing Data : Listing and Grouping pdfSummarizing Data : Listing and Grouping pdf
Summarizing Data : Listing and Grouping pdfJustynOwen
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
1. chapter i(pasw)
1. chapter i(pasw)1. chapter i(pasw)
1. chapter i(pasw)Chhom Karath
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 

Semelhante a Datamining (20)

CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
3 module 2
3 module 23 module 2
3 module 2
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
CLuster analysis presentation.pptx
CLuster analysis presentation.pptxCLuster analysis presentation.pptx
CLuster analysis presentation.pptx
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
 
Classification Systems
Classification SystemsClassification Systems
Classification Systems
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
Cluster Validation
Cluster ValidationCluster Validation
Cluster Validation
 
1) Chapter#02 Presentation of Data.ppt
1) Chapter#02 Presentation of Data.ppt1) Chapter#02 Presentation of Data.ppt
1) Chapter#02 Presentation of Data.ppt
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
 
Summarizing Data : Listing and Grouping pdf
Summarizing Data : Listing and Grouping pdfSummarizing Data : Listing and Grouping pdf
Summarizing Data : Listing and Grouping pdf
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
1. chapter i(pasw)
1. chapter i(pasw)1. chapter i(pasw)
1. chapter i(pasw)
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 

Mais de lalithambiga kamaraj (20)

Firewall in Network Security
Firewall in Network SecurityFirewall in Network Security
Firewall in Network Security
 
Data Compression in Multimedia
Data Compression in MultimediaData Compression in Multimedia
Data Compression in Multimedia
 
Data CompressionMultimedia
Data CompressionMultimediaData CompressionMultimedia
Data CompressionMultimedia
 
Digital Audio in Multimedia
Digital Audio in MultimediaDigital Audio in Multimedia
Digital Audio in Multimedia
 
Network Security: Physical security
Network Security: Physical security Network Security: Physical security
Network Security: Physical security
 
Graphs in Data Structure
Graphs in Data StructureGraphs in Data Structure
Graphs in Data Structure
 
Package in Java
Package in JavaPackage in Java
Package in Java
 
Exception Handling in Java
Exception Handling in JavaException Handling in Java
Exception Handling in Java
 
Data structure
Data structureData structure
Data structure
 
Digital Image Processing
Digital Image ProcessingDigital Image Processing
Digital Image Processing
 
Digital Image Processing
Digital Image ProcessingDigital Image Processing
Digital Image Processing
 
Estimating Software Maintenance Costs
Estimating Software Maintenance CostsEstimating Software Maintenance Costs
Estimating Software Maintenance Costs
 
Digital Components
Digital ComponentsDigital Components
Digital Components
 
Deadlocks in operating system
Deadlocks in operating systemDeadlocks in operating system
Deadlocks in operating system
 
Io management disk scheduling algorithm
Io management disk scheduling algorithmIo management disk scheduling algorithm
Io management disk scheduling algorithm
 
Recovery system
Recovery systemRecovery system
Recovery system
 
File management
File managementFile management
File management
 
Preprocessor
PreprocessorPreprocessor
Preprocessor
 
Inheritance
InheritanceInheritance
Inheritance
 
Managing console of I/o operations & working with files
Managing console of I/o operations & working with filesManaging console of I/o operations & working with files
Managing console of I/o operations & working with files
 

Último

Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 

Último (20)

Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 

Datamining

  • 1. DATA MINING Submitted by K.Lalithambiga.,Msc(cs), Nadar Saraswathi College of Arts & Science., Theni.
  • 2. Introduction  Three types of attributes: • Nominal — values from an unordered set. • Ordinal — values from an ordered set. • Continuous — real numbers. Discretization: • Divide the range of a continuous attribute into intervals • Some classification algorithms only accept categorical attributes. • Reduce data size by Discretization. • Prepare for further analysis.
  • 3.  Discretization • Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. • Interval labels can then be used to replace actual data values.  Concept hierarchies • Reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
  • 4. Binning. Histogram analysis. Clustering analysis. Entropy-based discretization. Segmentation by natural partitioning.
  • 5.  Attribute values can be discretized by distributing the values into bin and replacing each bin by the mean bin value or bin median value.  These technique can be applied recursively to the resulting partitions in order to generate concept hierarchies.  Binning does not use class information and unsupervised discretization technique.  It is sensitive to the user-specified number of bins.
  • 6. Equal-width (distance) partitioning: • It divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward • But outliers may dominate presentation • Skewed data is not handled well. Equal-depth (frequency) partitioning: • It divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky.
  • 7. * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-width) bins: - Bin 1 (4-14): 4, 8, 9 - Bin 2(15-24): 15, 21, 21, 24 - Bin 3(25-34): 25, 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 7, 7, 7 - Bin 2: 20, 20, 20, 20 - Bin 3: 28, 28, 28, 28, 28 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4 - Bin 2: 15, 24, 24, 24 - Bin 3: 25, 25, 25, 25, 34
  • 8.  Histograms can also be used for discretization.  Partitioning rules can be applied to define range of values.  The histogram analyses algorithm can be applied recursively to each partition in order to automatically generate a multilevel concept hierarchy, with the procedure terminating once a prespecified number of concept levels have been reached.  A minimum interval size can be used per level to control the recursive procedure.  This specifies the minimum width of the partition, or the minimum member of partitions at each level.
  • 9.  A popular data reduction technique  Divide data into buckets and store average (sum) for each bucket  Can be constructed optimally in one dimension using dynamic programming  Related to quantization problems. 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000
  • 10.  Several techniques for determining buckets • Equiwidth – width of each bucket range is uniform • Equidepth – each bucket contains roughly the same number of contiguous samples • V-Optimal – weighted sum of the original values that each bucket represents, where bucket weight = number of values in a bucket • MaxDiff – bucket boundary is established between each pair for pairs having the B – 1 largest differences, where B is user defined  V-Optimal & MaxDiff most accurate and practical
  • 11.  A clustering algorithm can be applied to partition data into clusters or groups.  Each cluster forms a node of a concept hierarchy, where all noses are at the same conceptual level.  Each cluster may be further decomposed into sub- clusters, forming a lower kevel in the hierarchy.  Clusters may also be grouped together to form a higher-level concept hierarchy.  Can have hierarchical clustering and be stored in multi-dimensional index tree structures.
  • 12. Allows detection and removal of outliers
  • 13. Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is • S1 & S2 correspond to samples in S satisfying conditions A<v & A>=v The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization. E S T S Ent S Ent S S S S( , ) | | | | ( ) | | | | ( ) 1 1 2 2
  • 14. The process is recursively applied to partitions obtained until some stopping criterion is met, e.g., Experiments show that it may reduce data size and improve classification accuracy Ent S E T S( ) ( , ) 
  • 15.  3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals.  If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi- width intervals.  If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals.  If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals.
  • 16. (-$4000 -$5,000) (-$4000 - 0) (-$4000 - -$3000) (-$3000 - -$2000) (-$2000 - -$1000) (-$1000 - 0) (0 - $1,000) (0 - $200) ($200 - $400) ($400 - $600) ($600 - $800) ($800 - $1,000) ($2,000 - $5, 000) ($2,000 - $3,000) ($3,000 - $4,000) ($4,000 - $5,000) ($1,000 - $2, 000) ($1,000 - $1,200) ($1,200 - $1,400) ($1,400 - $1,600) ($1,600 - $1,800) ($1,800 - $2,000) msd=1,000 Low=-$1,000 High=$2,000Step 2: Step 4: Step 1: -$351 -$159 profit $1,838 $4,700 Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max count (-$1,000 - $2,000) (-$1,000 - 0) (0 -$ 1,000) Step 3: ($1,000 - $2,000)
  • 17.  Step 1 – Min=-$351,976, Max=$4,700,896, low (5th percentile)=-$159,876, high (95th percentile)=$1,838,761.  Step 2 – For low and high, most significant digit is at $1,000,000, rounding low -$1,000,000, rounding high $2,000,000.  Step 3 – interval ranges over 3 distinct values at the most significant digit, so using 3-4-5 rule partition into 3 intervals, - $1,000,000-$0, $0-$1,000,000, and $1,000,000-$2,000,000.  Step 4 – Examine Min & Max values to see how they “fit” into first level partitions, first partition covers Min value, so adjust left boundary to make partition smaller, last partition doesn’t cover Max value, so create a new partition (round max up to next significant digit) $2,000,000-$5,000,000.  Step 5 – Recursively, each interval can be further partitioned using 3-4-5 rule to form next lower level of the hierarchy .
  • 18.  Specification of a partial ordering of attributes explicitly at the schema level by users or experts • Example : rel db may contain: street, city, province_or_state, country • Expert defines ordering of hierarchy such as street < city < province_or_state < country  Specification of a portion of a hierarchy by explicit data grouping • Example : province_or_state, country : {Alberta, Saskatchewan, Manitoba} – prairies_Canada & {British Columbia, prairies_Canada} – Western Canada
  • 19.  Specification of a set of attributes, but not of their partial ordering. • Auto generate the attribute ordering based upon observation that attribute defining a high level concept has a smaller # of distinct values than an attribute defining a lower level concept • Example : country (15), state_or_province (365), city (3567), street (674,339)  Specification of only a partial set of attributes • Try and parse database schema to determine complete hierarchy.
  • 20.  Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set.  The attribute with the most distinct values is placed at the lowest level of the hierarchy. country province_or_ state city street 15 distinct values 65 distinct values 3567 distinct values 674,339 distinct values