SlideShare a Scribd company logo
1 of 21
Download to read offline
A Review of Distributed Decision
        Tree Induction




                       Gregory Gay
                   West Virginia University
                    greg@greggay.com
Data Mining 101

        ●   Data mining = finding
            patterns in heaps of
            data
        ●   Classification = Given
            data and a series of
            possible labels,
            gather evidence and
            assign a label to each
            piece of data.
                                    2
Decision Tree Induction
●   These algorithms greedily and recursively
    select the attribute that provides the most
    information gain.
●   This attribute is used to partition the data set
    into subsets around a particular value.
●   Stops when the leaf node in the tree only
    contains data with the proper class label.



                                                       3
Splitting the Data
●   Split data based on the (attribute,value) pair
    that maximizes:
●


●   I(T) is an impurity measure. This is usually
    entropy(C4.5) or gini(SMP):




                                                     4
Climbing Mount Everest

●   The success of data
    mining is its downfall.
●   Fast, accurate results
    expected on huge
    data sets.
●   Parallelism of data
    and/or the algorithm.


                                   5
Paradigm A: Data Parallelism
●   Given the large size of these data sets,
    gathering all of the data in a centralized location
    or in a single file is neither desirable nor always
    feasible because of the bandwidth and storage
    requirements.
●   Also, data may be fragmented in order to
    address privacy and security constraints.
●   In these cases, there is a need for algorithms
    that can learn from fragmented data.
●   Two methods – horizontal, vertical
                                                          6
Horizontal Fragmentation
●   The entries are equally divided into a number of
    subsets equal to the number of different storage
    sites.
●   The learner uses frequency counts to find the
    attribute that yields the most information gain to
    further partition the set of examples.
●   Given: |E|= # examples in the data set T , |A|=
    #attributes, V= max(values per attribute),
    M=#sites, N=#classes, size(D)=#nodes in tree D
●   Time complexity = |E||A|size(D)
                                                         7
●   Communication complexity = N|A|VMsize(D)
Vertical Fragmentation
●   Individual attributes or subsets of attributes,
    along with a list of (value, identifier) pairs for
    each are distributed to sites.
●   Can suffer from load imbalance and poor
    scalability.
●   Given: |E|= # examples in the data set T , |A|=
    #attributes, V= max(values per attribute),
    M=#sites, N=#classes, size(D)=#nodes in tree
    D
●   Time complexity = |E||A|Msize(D)
                                                         8
●   Comm. complexity = (|E|+N|A|V)M size(D)
When Distributed Data Wins
●   It would be possible to collect those fragments
    and reassemble them in a centralized location.
    So, why we should leave the data at distributed
    sites?
●   Privacy or security - learner must perform solely
    on statistical summaries.
●   Distributed versions of algorithms compare
    favorably with the corresponding centralized
    technique whenever its communication cost is
    less than the cost of collecting all of the data in
    one place.                                            9
Paradigm B: Task Parallelism
●   The construction of decision trees in parallel.
●   Single process begins the construction process.
    When the #child nodes = #available
    processors, the nodes are split among them.
    Each processor then proceeds with the tree
    construction algorithm.
●   Implementing parallel algorithms for decision
    tree classification is a difficult task for multiple
    reasons.

                                                           10
Paradigm C (B.5?): Hybrid
               Parallelism
●   Most parallel algorithm don’t practice strict task
    parallelism
●   Instead – parallelized algorithm over split data.
●   Some (parallelized C4.5), switch between
    data/task parallelism when communication cost
    is too high.
●   Others always operate in both modes (SMP,
    INDUS).

                                                        11
Algorithm: C4.5 Parallel
●   Parallel implementation of Ross Quinlan’s C4.5
    decision tree algorithm.
●   Data parallelism at the beginning, task
    parallelism at the lower nodes of the tree.
●   Horizontal fragmentation scheme. Uniform load
    balance is achieved by equally distributing data
    among the processors and using a breadth-first
    strategy to build the actual decision tree. When
    a process finished exploring it nodes, it sends a
    request for more nodes.
                                                    12
●   Each processor is responsible for building its
    own attribute lists and class list.
●   The continuous attribute lists are globally
    sorted. Each processor has a list of sorted
    values.
●   Before evaluating the possible split points in
    their entries, the distributions must reflect the
    data entries assigned to the other processors.
    –   Discrete – store the data everywhere
    –   Continuous - the gain calculated based on the
        distributions before and after the split point.
●   After evaluating all of the local data, the
    processors communicate among themselves in
    order to find the best split.               13
●   Horizontal data fragmentation as long as the
    #examples covered by the nodes > a pre-
    defined threshold (when communication cost
    for building a node > cost of building it locally
    plus the cost of moving the set of examples
    between processors)
●   When #data entries < threshold, they are
    evenly distributed among the processors.




                                                        14
Algorithm: SMP Tree Classifier
●   SMP clusters = shared-memory nodes with a
    small number of processors (usually 2-8)
    connected together with a high-speed
    interconnect.
●   Two-tiered architecture where a combination of
    shared-memory and distributed-memory
    algorithms can be deployed.
●   The training dataset is horizontally fragmented
    across the SMP nodes so each node carries
    out tree construction using an equal subset of
    the overall data set.                           15
●   Attribute lists are dynamically scheduled to take
    advantage of the parallel light-weight threads
●   One of these threads is designated as a master
    thread. Is responsible for processing the
    attribute lists as well as exchanging data and
    information between SMP nodes.
●   The numerical attribute lists are globally sorted
    such that the first node has the first portion of
    the data set
●   When a tree node is split into its children, the
    local attribute lists are partitioned among those
    children. Thus, no communication costs are
    incurred due to the attribute lists as the tree is
                                                       16
    grown in each SMP node.
●   Finding global best split causes synchronization
●   The algorithm will proceed in a breadth-first
    manner, processing all of the leaf nodes before
    processing any of the new child nodes.
●   Before finding the new best split point, attribute
    lists inserted into a queue shared by all of the
    threads running on the SMP node. This queue
    is used to schedule attribute lists to threads.
●   Each leaf node calculates and compares its
    split points with all of those along the leaf nodes
●   SMP node 0 can compute the overall best split
    point for all attributes. It will broadcast this point
    to all of the SMP nodes.
                                                         17
INDUS
●   INDUS is a multi-agent system used for
    knowledge acquisition from data sources
●   Can deal with horizontal or vertical data, only
    needs statistical summaries of the data.
●   Designed to provide a unified query interface
    over a set of distributed, heterogeneous, and
    autonomous data sources as if the dataset
    were a table.
●   These queries are driven by an internal tree
    learner that uses them whenever a new node
    needs to be added to the tree.                    18
Summary
●   We want to classify data. Decision tree
    classifiers are a good way to do it.
●   We need to classify BIG data sets.
●   We can improve performance by parallelizing
    the data (horizontally or vertically) or the
    algorithm. Often both.
●   We looked at a few algorithms to do this.
●   They all perform well, show a promising
    improvement over standard methods.
                                                   19
Now you can climb the mountain!




                                  20
Questions? Comments?
●   Paper: http://greggay.com/pdf/dsys1.pdf
●   greg@greggay.com
●   http://twitter.com/Greg4cr
●   http://facebook.com/greg.gay




                                              21

More Related Content

What's hot

Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction StratergiesAnjaliSoorej
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretizationKrish_ver2
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17AnwarrChaudary
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingSlideshare
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unitbhagathk
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Rupak Roy
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dwANUSUYA T K
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessingKrish_ver2
 

What's hot (19)

Data reduction
Data reductionData reduction
Data reduction
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
Data discretization
Data discretizationData discretization
Data discretization
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dw
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
 
Data mining
Data miningData mining
Data mining
 
Data preparation
Data preparationData preparation
Data preparation
 

Similar to Distributed Decision Tree Induction

Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Anant Corporation
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf4NM20IS025BHUSHANNAY
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDaniel Marcous
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemVarad Meru
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshersrajkamaltibacademy
 
ADBS_parallel Databases in Advanced DBMS
ADBS_parallel Databases in Advanced DBMSADBS_parallel Databases in Advanced DBMS
ADBS_parallel Databases in Advanced DBMSchandugoswami
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical dataPaul Skeie
 
Decision trees
Decision treesDecision trees
Decision treesNcib Lotfi
 
Advanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataAdvanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataVictor Smirnov
 
The Entity Registry System: Collaborative Editing of Entity Data in Poorly Co...
The Entity Registry System: Collaborative Editing of Entity Data in Poorly Co...The Entity Registry System: Collaborative Editing of Entity Data in Poorly Co...
The Entity Registry System: Collaborative Editing of Entity Data in Poorly Co...Christophe Guéret
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptxRaflyRizky2
 
Memory Efficient Graph Convolutional Network based Distributed Link Prediction
Memory Efficient Graph Convolutional Network based Distributed Link PredictionMemory Efficient Graph Convolutional Network based Distributed Link Prediction
Memory Efficient Graph Convolutional Network based Distributed Link Predictionmiyurud
 
If the data cannot come to the algorithm...
If the data cannot come to the algorithm...If the data cannot come to the algorithm...
If the data cannot come to the algorithm...Robert Burrell Donkin
 

Similar to Distributed Decision Tree Induction (20)

Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & Architectures
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage System
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshers
 
ADBS_parallel Databases in Advanced DBMS
ADBS_parallel Databases in Advanced DBMSADBS_parallel Databases in Advanced DBMS
ADBS_parallel Databases in Advanced DBMS
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical data
 
Decision trees
Decision treesDecision trees
Decision trees
 
Distributed Coordination-Based Systems
Distributed Coordination-Based SystemsDistributed Coordination-Based Systems
Distributed Coordination-Based Systems
 
Advanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataAdvanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big Data
 
The Entity Registry System: Collaborative Editing of Entity Data in Poorly Co...
The Entity Registry System: Collaborative Editing of Entity Data in Poorly Co...The Entity Registry System: Collaborative Editing of Entity Data in Poorly Co...
The Entity Registry System: Collaborative Editing of Entity Data in Poorly Co...
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
 
Memory Efficient Graph Convolutional Network based Distributed Link Prediction
Memory Efficient Graph Convolutional Network based Distributed Link PredictionMemory Efficient Graph Convolutional Network based Distributed Link Prediction
Memory Efficient Graph Convolutional Network based Distributed Link Prediction
 
If the data cannot come to the algorithm...
If the data cannot come to the algorithm...If the data cannot come to the algorithm...
If the data cannot come to the algorithm...
 

More from gregoryg

Community-Assisted Software Engineering Decision Making
Community-Assisted Software Engineering Decision MakingCommunity-Assisted Software Engineering Decision Making
Community-Assisted Software Engineering Decision Makinggregoryg
 
The Robust Optimization of Non-Linear Requirements Models
The Robust Optimization of Non-Linear Requirements ModelsThe Robust Optimization of Non-Linear Requirements Models
The Robust Optimization of Non-Linear Requirements Modelsgregoryg
 
Finding Robust Solutions to Requirements Models
Finding Robust Solutions to Requirements ModelsFinding Robust Solutions to Requirements Models
Finding Robust Solutions to Requirements Modelsgregoryg
 
Irrf Presentation
Irrf PresentationIrrf Presentation
Irrf Presentationgregoryg
 
Optimizing Requirements Decisions with KEYS
Optimizing Requirements Decisions with KEYSOptimizing Requirements Decisions with KEYS
Optimizing Requirements Decisions with KEYSgregoryg
 
Confidence in Software Cost Estimation Results based on MMRE and PRED
Confidence in Software Cost Estimation Results based on MMRE and PREDConfidence in Software Cost Estimation Results based on MMRE and PRED
Confidence in Software Cost Estimation Results based on MMRE and PREDgregoryg
 
Promise08 Wrapup
Promise08 WrapupPromise08 Wrapup
Promise08 Wrapupgregoryg
 
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...gregoryg
 
Software Defect Repair Times: A Multiplicative Model
Software Defect Repair Times: A Multiplicative ModelSoftware Defect Repair Times: A Multiplicative Model
Software Defect Repair Times: A Multiplicative Modelgregoryg
 
Complementing Approaches in ERP Effort Estimation Practice: an Industrial Study
Complementing Approaches in ERP Effort Estimation Practice: an Industrial StudyComplementing Approaches in ERP Effort Estimation Practice: an Industrial Study
Complementing Approaches in ERP Effort Estimation Practice: an Industrial Studygregoryg
 
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...gregoryg
 
Implications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect PredictorsImplications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect Predictorsgregoryg
 
Practical use of defect detection and prediction
Practical use of defect detection and predictionPractical use of defect detection and prediction
Practical use of defect detection and predictiongregoryg
 
Risk And Relevance 20080414ppt
Risk And Relevance 20080414pptRisk And Relevance 20080414ppt
Risk And Relevance 20080414pptgregoryg
 
Organizations Use Data
Organizations Use DataOrganizations Use Data
Organizations Use Datagregoryg
 
Cukic Promise08 V3
Cukic Promise08 V3Cukic Promise08 V3
Cukic Promise08 V3gregoryg
 
Boetticher Presentation Promise 2008v2
Boetticher Presentation Promise 2008v2Boetticher Presentation Promise 2008v2
Boetticher Presentation Promise 2008v2gregoryg
 
Elane - Promise08
Elane - Promise08Elane - Promise08
Elane - Promise08gregoryg
 
Risk And Relevance 20080414ppt
Risk And Relevance 20080414pptRisk And Relevance 20080414ppt
Risk And Relevance 20080414pptgregoryg
 
Introduction Promise 2008 V3
Introduction Promise 2008 V3Introduction Promise 2008 V3
Introduction Promise 2008 V3gregoryg
 

More from gregoryg (20)

Community-Assisted Software Engineering Decision Making
Community-Assisted Software Engineering Decision MakingCommunity-Assisted Software Engineering Decision Making
Community-Assisted Software Engineering Decision Making
 
The Robust Optimization of Non-Linear Requirements Models
The Robust Optimization of Non-Linear Requirements ModelsThe Robust Optimization of Non-Linear Requirements Models
The Robust Optimization of Non-Linear Requirements Models
 
Finding Robust Solutions to Requirements Models
Finding Robust Solutions to Requirements ModelsFinding Robust Solutions to Requirements Models
Finding Robust Solutions to Requirements Models
 
Irrf Presentation
Irrf PresentationIrrf Presentation
Irrf Presentation
 
Optimizing Requirements Decisions with KEYS
Optimizing Requirements Decisions with KEYSOptimizing Requirements Decisions with KEYS
Optimizing Requirements Decisions with KEYS
 
Confidence in Software Cost Estimation Results based on MMRE and PRED
Confidence in Software Cost Estimation Results based on MMRE and PREDConfidence in Software Cost Estimation Results based on MMRE and PRED
Confidence in Software Cost Estimation Results based on MMRE and PRED
 
Promise08 Wrapup
Promise08 WrapupPromise08 Wrapup
Promise08 Wrapup
 
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
Improving Analogy Software Effort Estimation using Fuzzy Feature Subset Selec...
 
Software Defect Repair Times: A Multiplicative Model
Software Defect Repair Times: A Multiplicative ModelSoftware Defect Repair Times: A Multiplicative Model
Software Defect Repair Times: A Multiplicative Model
 
Complementing Approaches in ERP Effort Estimation Practice: an Industrial Study
Complementing Approaches in ERP Effort Estimation Practice: an Industrial StudyComplementing Approaches in ERP Effort Estimation Practice: an Industrial Study
Complementing Approaches in ERP Effort Estimation Practice: an Industrial Study
 
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
Multi-criteria Decision Analysis for Customization of Estimation by Analogy M...
 
Implications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect PredictorsImplications of Ceiling Effects in Defect Predictors
Implications of Ceiling Effects in Defect Predictors
 
Practical use of defect detection and prediction
Practical use of defect detection and predictionPractical use of defect detection and prediction
Practical use of defect detection and prediction
 
Risk And Relevance 20080414ppt
Risk And Relevance 20080414pptRisk And Relevance 20080414ppt
Risk And Relevance 20080414ppt
 
Organizations Use Data
Organizations Use DataOrganizations Use Data
Organizations Use Data
 
Cukic Promise08 V3
Cukic Promise08 V3Cukic Promise08 V3
Cukic Promise08 V3
 
Boetticher Presentation Promise 2008v2
Boetticher Presentation Promise 2008v2Boetticher Presentation Promise 2008v2
Boetticher Presentation Promise 2008v2
 
Elane - Promise08
Elane - Promise08Elane - Promise08
Elane - Promise08
 
Risk And Relevance 20080414ppt
Risk And Relevance 20080414pptRisk And Relevance 20080414ppt
Risk And Relevance 20080414ppt
 
Introduction Promise 2008 V3
Introduction Promise 2008 V3Introduction Promise 2008 V3
Introduction Promise 2008 V3
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Distributed Decision Tree Induction

  • 1. A Review of Distributed Decision Tree Induction Gregory Gay West Virginia University greg@greggay.com
  • 2. Data Mining 101 ● Data mining = finding patterns in heaps of data ● Classification = Given data and a series of possible labels, gather evidence and assign a label to each piece of data. 2
  • 3. Decision Tree Induction ● These algorithms greedily and recursively select the attribute that provides the most information gain. ● This attribute is used to partition the data set into subsets around a particular value. ● Stops when the leaf node in the tree only contains data with the proper class label. 3
  • 4. Splitting the Data ● Split data based on the (attribute,value) pair that maximizes: ● ● I(T) is an impurity measure. This is usually entropy(C4.5) or gini(SMP): 4
  • 5. Climbing Mount Everest ● The success of data mining is its downfall. ● Fast, accurate results expected on huge data sets. ● Parallelism of data and/or the algorithm. 5
  • 6. Paradigm A: Data Parallelism ● Given the large size of these data sets, gathering all of the data in a centralized location or in a single file is neither desirable nor always feasible because of the bandwidth and storage requirements. ● Also, data may be fragmented in order to address privacy and security constraints. ● In these cases, there is a need for algorithms that can learn from fragmented data. ● Two methods – horizontal, vertical 6
  • 7. Horizontal Fragmentation ● The entries are equally divided into a number of subsets equal to the number of different storage sites. ● The learner uses frequency counts to find the attribute that yields the most information gain to further partition the set of examples. ● Given: |E|= # examples in the data set T , |A|= #attributes, V= max(values per attribute), M=#sites, N=#classes, size(D)=#nodes in tree D ● Time complexity = |E||A|size(D) 7 ● Communication complexity = N|A|VMsize(D)
  • 8. Vertical Fragmentation ● Individual attributes or subsets of attributes, along with a list of (value, identifier) pairs for each are distributed to sites. ● Can suffer from load imbalance and poor scalability. ● Given: |E|= # examples in the data set T , |A|= #attributes, V= max(values per attribute), M=#sites, N=#classes, size(D)=#nodes in tree D ● Time complexity = |E||A|Msize(D) 8 ● Comm. complexity = (|E|+N|A|V)M size(D)
  • 9. When Distributed Data Wins ● It would be possible to collect those fragments and reassemble them in a centralized location. So, why we should leave the data at distributed sites? ● Privacy or security - learner must perform solely on statistical summaries. ● Distributed versions of algorithms compare favorably with the corresponding centralized technique whenever its communication cost is less than the cost of collecting all of the data in one place. 9
  • 10. Paradigm B: Task Parallelism ● The construction of decision trees in parallel. ● Single process begins the construction process. When the #child nodes = #available processors, the nodes are split among them. Each processor then proceeds with the tree construction algorithm. ● Implementing parallel algorithms for decision tree classification is a difficult task for multiple reasons. 10
  • 11. Paradigm C (B.5?): Hybrid Parallelism ● Most parallel algorithm don’t practice strict task parallelism ● Instead – parallelized algorithm over split data. ● Some (parallelized C4.5), switch between data/task parallelism when communication cost is too high. ● Others always operate in both modes (SMP, INDUS). 11
  • 12. Algorithm: C4.5 Parallel ● Parallel implementation of Ross Quinlan’s C4.5 decision tree algorithm. ● Data parallelism at the beginning, task parallelism at the lower nodes of the tree. ● Horizontal fragmentation scheme. Uniform load balance is achieved by equally distributing data among the processors and using a breadth-first strategy to build the actual decision tree. When a process finished exploring it nodes, it sends a request for more nodes. 12
  • 13. Each processor is responsible for building its own attribute lists and class list. ● The continuous attribute lists are globally sorted. Each processor has a list of sorted values. ● Before evaluating the possible split points in their entries, the distributions must reflect the data entries assigned to the other processors. – Discrete – store the data everywhere – Continuous - the gain calculated based on the distributions before and after the split point. ● After evaluating all of the local data, the processors communicate among themselves in order to find the best split. 13
  • 14. Horizontal data fragmentation as long as the #examples covered by the nodes > a pre- defined threshold (when communication cost for building a node > cost of building it locally plus the cost of moving the set of examples between processors) ● When #data entries < threshold, they are evenly distributed among the processors. 14
  • 15. Algorithm: SMP Tree Classifier ● SMP clusters = shared-memory nodes with a small number of processors (usually 2-8) connected together with a high-speed interconnect. ● Two-tiered architecture where a combination of shared-memory and distributed-memory algorithms can be deployed. ● The training dataset is horizontally fragmented across the SMP nodes so each node carries out tree construction using an equal subset of the overall data set. 15
  • 16. Attribute lists are dynamically scheduled to take advantage of the parallel light-weight threads ● One of these threads is designated as a master thread. Is responsible for processing the attribute lists as well as exchanging data and information between SMP nodes. ● The numerical attribute lists are globally sorted such that the first node has the first portion of the data set ● When a tree node is split into its children, the local attribute lists are partitioned among those children. Thus, no communication costs are incurred due to the attribute lists as the tree is 16 grown in each SMP node.
  • 17. Finding global best split causes synchronization ● The algorithm will proceed in a breadth-first manner, processing all of the leaf nodes before processing any of the new child nodes. ● Before finding the new best split point, attribute lists inserted into a queue shared by all of the threads running on the SMP node. This queue is used to schedule attribute lists to threads. ● Each leaf node calculates and compares its split points with all of those along the leaf nodes ● SMP node 0 can compute the overall best split point for all attributes. It will broadcast this point to all of the SMP nodes. 17
  • 18. INDUS ● INDUS is a multi-agent system used for knowledge acquisition from data sources ● Can deal with horizontal or vertical data, only needs statistical summaries of the data. ● Designed to provide a unified query interface over a set of distributed, heterogeneous, and autonomous data sources as if the dataset were a table. ● These queries are driven by an internal tree learner that uses them whenever a new node needs to be added to the tree. 18
  • 19. Summary ● We want to classify data. Decision tree classifiers are a good way to do it. ● We need to classify BIG data sets. ● We can improve performance by parallelizing the data (horizontally or vertically) or the algorithm. Often both. ● We looked at a few algorithms to do this. ● They all perform well, show a promising improvement over standard methods. 19
  • 20. Now you can climb the mountain! 20
  • 21. Questions? Comments? ● Paper: http://greggay.com/pdf/dsys1.pdf ● greg@greggay.com ● http://twitter.com/Greg4cr ● http://facebook.com/greg.gay 21