SlideShare uma empresa Scribd logo
1 de 23
Data Reduction:
Attribute Subset Selection
and Data Cube Aggregation
PREPARED BY: RAJAN SHAH
DMBI
SVIT, VASAD
Data Reduction
Data Reduction techniques can be applied to obtain
a reduced representation of the data set that is
much smaller in volume, yet closely maintains the
integrity of the original data.
That, is, Mining on the reduced data set should be
more efficient yet produce the same analytical
results.
Data Reduction Strategies
1. Dimensionality Reduction
2. Numerosity Reduction
3. Data Compression
Dimensionality Reduction
Dimensionality Reduction is the process of
reducing the number of random variables or
attributes under consideration.
Attribute Subset Selection is a method of
dimensionality reduction in which irrelevant,
weakly relevant, or redundant attributes are
detected and removed.
Numerosity Reduction
These techniques replace the original data volume
by alternative, smaller forms of data
representation. May be Parametric or Non-
Parametric.
Parametric Methods: A model is used to estimate
the data, so that only the data parameters need to
be restored and not the actual data. It assumes
that the data fits some model estimates model
parameters.
Examples: Regression and Log-Linear Models.
Cont…
Non-Parametric Methods: Do not assume the data
and are used for storing reduced representations of
the data which includes Histograms, Clustering,
Sampling and Data Cube Aggregation.
Data Compression
Transformations are applied so as to obtain a
“COMPRESSED” representation of the original data.
If the original data can be reconstructed from the
compressed one without loss of any information, it
is called Lossless Data Reduction, else it is called
Lossy Data Reduction.
Attribute Subset Selection
Also known as Feature Selection, which is a
procedure to find a subset of features (relevant to
mining task) to produce “better” model for given
dataset, i.e. removal of redundant data from the
data set which can slow down the mining process.
AIM: To find a minimum set of attributes such that
the mining process results are as close as possible
to the original distribution obtained using all
attributes.
Advantages
Mining on Reduced set of Attributes result in
reduced number of attributes and thus helping to
make patterns easier to detect and understand.
How To Find a GOOD Subset?
For n attributes, there are 2n possible subsets and
thus the methods applied are “greedy” in that,
while searching through attribute space, they
always make what looks to be the local best choice
assuming that it will lead to the global optimal
result.
The BEST and WORST attributes are determined
using tests of Statistical significance assuming the
attributes are independent of each other.
Information Gain can be used to evaluate attributes.
Methods: Stepwise Forward
Selection
It starts with no variables in the
model and testing the addition
of each variable using a chosen
model fit criterion, adding the
variable (if any) whose
inclusion gives the most
statistically significant
improvement of the fit, and
repeating this process until
none improves the model to a
statistically significant extent.
Example:
Stepwise Backward
Elimination
It involves starting with all
candidate variables, testing
the deletion of each variable
using a chosen model fit
criterion, deleting the
variable (if any) whose loss
gives the most statistically
insignificant deterioration of
the model fit, and repeating
this process until no further
variables can be deleted
without a statistically
significant loss of fit.
Example:
Bi-Directional Selection and
Elimination
The stepwise forward
selection and backward
elimination methods can
be combined so that, at
each step, the procedure
selects the best attribute
and removes the worst
from among the
remaining attributes.
Example: Suppose,
when A1(best) is
selected, at the same
time A2(worst) is
eliminated. And similarly
when A4 is selected, A5
gets eliminated and
when A6 is selected, A3
is eliminated, thus
forming the reduced set
{A1, A4, A6}.
Decision Tree Induction
Decision Tree Induction constructs a flowchart where
each internal non-leaf node denotes a test on an
attribute, each branch corresponds to an outcome of
the test, and each external leaf node denotes class-
prediction.
At each node, the algorithm chooses the “best”
attribute to partition the data into individual classes.
All the attributes that do not appear in the tree are
assumed to be irrelevant, while the attributes that
belong to the tree form the reduced data set.
Cont…
Data Cube Aggregation
A data cube is generally used to easily interpret
data. It is especially useful when representing data
together with dimensions as certain measures of
business requirements. A cube's every dimension
represents certain characteristic of the database.
Data Cubes store multidimensional aggregated
information.
Data cubes provide fast access to precomputed,
summarized data, thereby benefiting online
analytical processing (OLAP) as well as data mining.
Categories of Data Cube
Dimensions: Represents
categories of data such
as time or location.
Each dimension includes
different levels of
categories.
Example:
Cont…
Measures: These are the
actual data values that
occupy the cells as
defined by the
dimensions selected.
Measures include facts or
variables typically stored
as numerical fields.
Example:
Cont…
Example: For the data set of employees with their
dept_id, salary, data cube can be used to aggregate
the data so that resulting data summarizes the total
salary corresponding to the dept_id.
The Resulting data is smaller in volume, without loss
of information necessary for analysis task.
Cont…
Concept Hierarchies may exist for each attribute,
allowing the analysis of data at multiple abstraction
levels.
The Cube created at the lowest abstraction level is
called– Base Cuboid.
The Cube created at the highest abstraction level is
called– Apex Cuboid.
Data cube can be 2-D, 3-D or higher dimension.
When replying to data mining requests, the smallest
available cuboid relevant to the given task should be
used.
Example
References
https://www.slideshare.net/algum/data-cubes-
7923771
https://en.wikipedia.org/wiki/Data_cube
Data Reduction

Mais conteúdo relacionado

Mais procurados

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

Mais procurados (20)

Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Tree pruning
 Tree pruning Tree pruning
Tree pruning
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Decision tree
Decision treeDecision tree
Decision tree
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data cubes
Data cubesData cubes
Data cubes
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 

Semelhante a Data Reduction

Semelhante a Data Reduction (20)

Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 
Working with the data for Machine Learning
Working with the data for Machine LearningWorking with the data for Machine Learning
Working with the data for Machine Learning
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Data .pptx
Data .pptxData .pptx
Data .pptx
 
DATA MINING.pptx
DATA MINING.pptxDATA MINING.pptx
DATA MINING.pptx
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
1234
12341234
1234
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction Techniques
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
 

Mais de Rajan Shah (10)

Xml dtd- Document Type Definition- Web Technology
Xml dtd- Document Type Definition- Web TechnologyXml dtd- Document Type Definition- Web Technology
Xml dtd- Document Type Definition- Web Technology
 
Timing and control circuit
Timing and control circuitTiming and control circuit
Timing and control circuit
 
Rethrowing exception- JAVA
Rethrowing exception- JAVARethrowing exception- JAVA
Rethrowing exception- JAVA
 
Np Completeness
Np CompletenessNp Completeness
Np Completeness
 
Lex Tool
Lex ToolLex Tool
Lex Tool
 
Files and streams In Java
Files and streams In JavaFiles and streams In Java
Files and streams In Java
 
Deadlock- Operating System
Deadlock- Operating SystemDeadlock- Operating System
Deadlock- Operating System
 
Cyclic Redundancy Check
Cyclic Redundancy CheckCyclic Redundancy Check
Cyclic Redundancy Check
 
Client server s/w Engineering
Client server s/w EngineeringClient server s/w Engineering
Client server s/w Engineering
 
Bluetooth protocol
Bluetooth protocolBluetooth protocol
Bluetooth protocol
 

Último

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 

Data Reduction

  • 1. Data Reduction: Attribute Subset Selection and Data Cube Aggregation PREPARED BY: RAJAN SHAH DMBI SVIT, VASAD
  • 2. Data Reduction Data Reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That, is, Mining on the reduced data set should be more efficient yet produce the same analytical results.
  • 3. Data Reduction Strategies 1. Dimensionality Reduction 2. Numerosity Reduction 3. Data Compression
  • 4. Dimensionality Reduction Dimensionality Reduction is the process of reducing the number of random variables or attributes under consideration. Attribute Subset Selection is a method of dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes are detected and removed.
  • 5. Numerosity Reduction These techniques replace the original data volume by alternative, smaller forms of data representation. May be Parametric or Non- Parametric. Parametric Methods: A model is used to estimate the data, so that only the data parameters need to be restored and not the actual data. It assumes that the data fits some model estimates model parameters. Examples: Regression and Log-Linear Models.
  • 6. Cont… Non-Parametric Methods: Do not assume the data and are used for storing reduced representations of the data which includes Histograms, Clustering, Sampling and Data Cube Aggregation.
  • 7. Data Compression Transformations are applied so as to obtain a “COMPRESSED” representation of the original data. If the original data can be reconstructed from the compressed one without loss of any information, it is called Lossless Data Reduction, else it is called Lossy Data Reduction.
  • 8. Attribute Subset Selection Also known as Feature Selection, which is a procedure to find a subset of features (relevant to mining task) to produce “better” model for given dataset, i.e. removal of redundant data from the data set which can slow down the mining process. AIM: To find a minimum set of attributes such that the mining process results are as close as possible to the original distribution obtained using all attributes.
  • 9. Advantages Mining on Reduced set of Attributes result in reduced number of attributes and thus helping to make patterns easier to detect and understand.
  • 10. How To Find a GOOD Subset? For n attributes, there are 2n possible subsets and thus the methods applied are “greedy” in that, while searching through attribute space, they always make what looks to be the local best choice assuming that it will lead to the global optimal result. The BEST and WORST attributes are determined using tests of Statistical significance assuming the attributes are independent of each other. Information Gain can be used to evaluate attributes.
  • 11. Methods: Stepwise Forward Selection It starts with no variables in the model and testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent. Example:
  • 12. Stepwise Backward Elimination It involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically significant loss of fit. Example:
  • 13. Bi-Directional Selection and Elimination The stepwise forward selection and backward elimination methods can be combined so that, at each step, the procedure selects the best attribute and removes the worst from among the remaining attributes. Example: Suppose, when A1(best) is selected, at the same time A2(worst) is eliminated. And similarly when A4 is selected, A5 gets eliminated and when A6 is selected, A3 is eliminated, thus forming the reduced set {A1, A4, A6}.
  • 14. Decision Tree Induction Decision Tree Induction constructs a flowchart where each internal non-leaf node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external leaf node denotes class- prediction. At each node, the algorithm chooses the “best” attribute to partition the data into individual classes. All the attributes that do not appear in the tree are assumed to be irrelevant, while the attributes that belong to the tree form the reduced data set.
  • 16. Data Cube Aggregation A data cube is generally used to easily interpret data. It is especially useful when representing data together with dimensions as certain measures of business requirements. A cube's every dimension represents certain characteristic of the database. Data Cubes store multidimensional aggregated information. Data cubes provide fast access to precomputed, summarized data, thereby benefiting online analytical processing (OLAP) as well as data mining.
  • 17. Categories of Data Cube Dimensions: Represents categories of data such as time or location. Each dimension includes different levels of categories. Example:
  • 18. Cont… Measures: These are the actual data values that occupy the cells as defined by the dimensions selected. Measures include facts or variables typically stored as numerical fields. Example:
  • 19. Cont… Example: For the data set of employees with their dept_id, salary, data cube can be used to aggregate the data so that resulting data summarizes the total salary corresponding to the dept_id. The Resulting data is smaller in volume, without loss of information necessary for analysis task.
  • 20. Cont… Concept Hierarchies may exist for each attribute, allowing the analysis of data at multiple abstraction levels. The Cube created at the lowest abstraction level is called– Base Cuboid. The Cube created at the highest abstraction level is called– Apex Cuboid. Data cube can be 2-D, 3-D or higher dimension. When replying to data mining requests, the smallest available cuboid relevant to the given task should be used.