Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

•

4 gostaram•1,659 visualizações

Jen Aman

Spark Summit 2016 talk by Firas Abuzaid (MIT)

Dados e análise

Yggdrasil: Faster Decision Trees Using
Column Partitioning in Spark
Firas Abuzaid, Joseph Bradley, Feynman Liang,
Andrew Feng, Lee Yang,
Matei Zaharia, Ameet Talwalkar
MIT, UCLA, Databricks, Yahoo!

Why Use Decision Trees?
• Arbitrarily expressive, but easy to interpret
• Simple to tune
• Natural support for different feature types
• Can easily extend to ensembles and boosting techniques

Why Use Deep Decision Trees?
• Today’s datasets are growing both in size and
dimension
– More rows and more columns
• To model high-dimensional data, we need to consider
more splits
– More splits ⇒ deeper trees

Decision Trees in Spark
• Partition training set by row
• Histograms used to compute splits
• Workers compute partial
histograms on subsets of rows
• Master aggregates partial
histograms, picks best feature to
split on

What’s wrong with row-partitioning?
1. High communication costs, esp. for deep trees & many
features
– Exponential in the depth of the tree, linear in the number of
features
2. To reduce communication, small number of thresholds
are considered
– Approximate split-finding using histograms
– User specifies # thresholds
– NB: Optimal split may not always be found

PROBLEM:
PARTITIONING BY ROW
⇒ INEFFICIENT FOR DEEP TREES
AND MANY FEATURES

PROBLEM:
PARTITIONING BY ROW
⇒ TRADEOFF BETWEEN
ACCURACY AND EFFICIENCY

Yggdrasil: The New Approach
• Workers compute
sufficient stats on entire
columns
• Master has one job: pick
best global split
• No approximation; all
thresholds considered

Yggdrasil: The New Approach
Communication cost much
lower for deep trees &
many features

Optimizations in Yggdrasil
• Columnar compression using
RLE
– Train directly on columns
without decompressing
• Label encoding for fewer
cache misses
• Sparse bit vectors to reduce
communication overheads

Results
• Classifying handwritten
digits
• 8.1 million rows
• 784 columns
• 18.2 GB (< 1% non-zeros)
Highest accuracy at D=19
6x Speedup

Results
• Regression task
• 2 million rows
• 3500 columns
• 52.2 GB (< 1% zeros)
24x Speedup

Results
• 2 million rows
• For Yggdrasil,
communication cost is
independent of the
number of features

Results
• 2 million rows
• Yggdrasil empirically
outperforms Spark
MLlib for deep trees
and many features
8.5x Speedup

Using Yggdrasil
• Available as a Spark package – download here
• Direct integration with Spark ML Pipeline API
• Support for various inputs: DataFrame,
RDD[Labeled Point], or Parquet files

Using Yggdrasil
Before:
val dt = new DecisionTreeClassifier()
.setFeaturesCol("indexedFeatures")
.setLabelCol(labelColName)
.setMaxDepth(params.maxDepth)
.setMaxBins(params.maxBins)
.setMinInstancesPerNode(params.minInstancesPerNode)
.setMinInfoGain(params.minInfoGain)
.setCacheNodeIds(params.cacheNodeIds)
.setCheckpointInterval(params.checkpointInterval)

Using Yggdrasil
After:
val dt = new YggdrasilClassifier()
.setFeaturesCol("indexedFeatures")
.setLabelCol(labelColName)
.setMaxDepth(params.maxDepth)
.setMaxBins(params.maxBins)
.setMinInstancesPerNode(params.minInstancesPerNode)
.setMinInfoGain(params.minInfoGain)
.setCacheNodeIds(params.cacheNodeIds)
.setCheckpointInterval(params.checkpointInterval)

Why should I have to choose?
• If Spark MLlib is better
for shallow trees and few
features…
• And Yggdrasil is better
for deeper trees and many
features…
• Why can’t I have both?

Future Work
• You should be able to have both!
• Next steps:
– Merge Yggdrasilinto Spark MLlib v2.x
– Add decision rule that automatically choosesthe best
partitioning strategy for you:
• by row (MLlib v1.6); or
• by column (Yggdrasil)

THANKS!
Email: fabuzaid@csail.mit.edu
GitHub: https://github.com/fabuzaid21/
Twitter: @FirasTheBoss
Website: http://firasabuzaid.com

Mais conteúdo relacionado

Mais procurados

Accelerating Data Science with Better Data Engineering on DatabricksDatabricks

Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Databricks

Spark Summit EU talk by Sameer AgarwalSpark Summit

Deploying Accelerators At Datacenter Scale Using SparkJen Aman

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank

Dev Ops TrainingSpark Summit

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit

Superworkflow of Graph Neural Networks with K8S and FugueDatabricks

Distributed Heterogeneous Mixture Learning On SparkSpark Summit

Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit

Build, Scale, and Deploy Deep Learning Pipelines with EaseDatabricks

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks

Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit

Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Databricks

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Spark Summit

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks

Mais procurados (20)

Accelerating Data Science with Better Data Engineering on Databricks

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...

Spark Summit EU talk by Sameer Agarwal

Deploying Accelerators At Datacenter Scale Using Spark

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...

Dev Ops Training

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...

Superworkflow of Graph Neural Networks with K8S and Fugue

Distributed Heterogeneous Mixture Learning On Spark

Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Build, Scale, and Deploy Deep Learning Pipelines with Ease

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...

Practical Distributed Machine Learning Pipelines on Hadoop

Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...

Destaque

Low Latency Execution For Apache SparkJen Aman

Spark And Cassandra: 2 Fast, 2 FuriousJen Aman

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Spark on MesosJen Aman

Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

Time-Evolving Graph Processing On Commodity ClustersJen Aman

Spark Uber Development KitJen Aman

Huohua: A Distributed Time Series Analysis Framework For SparkJen Aman

Scaling Machine Learning To Billions Of ParametersJen Aman

GPU Computing With Apache Spark And PythonJen Aman

Livy: A REST Web Service For Apache SparkJen Aman

Recent Developments In SparkR For Advanced AnalyticsDatabricks

Airstream: Spark Streaming At AirbnbJen Aman

Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman

Interactive Visualization of Streaming Data Powered by SparkSpark Summit

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks

Large Scale Deep Learning with TensorFlow Jen Aman

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine

Destaque (20)

Low Latency Execution For Apache Spark

Spark And Cassandra: 2 Fast, 2 Furious

Re-Architecting Spark For Performance Understandability

Spark on Mesos

Building Custom Machine Learning Algorithms With Apache SystemML

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

Time-Evolving Graph Processing On Commodity Clusters

Spark Uber Development Kit

Huohua: A Distributed Time Series Analysis Framework For Spark

Scaling Machine Learning To Billions Of Parameters

GPU Computing With Apache Spark And Python

Livy: A REST Web Service For Apache Spark

Recent Developments In SparkR For Advanced Analytics

Airstream: Spark Streaming At Airbnb

Elasticsearch And Apache Lucene For Apache Spark And MLlib

Interactive Visualization of Streaming Data Powered by Spark

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

Large Scale Deep Learning with TensorFlow

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

Semelhante a Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.

BSSML17 - DeepnetsBigML, Inc

Deploying Pretrained Model In Edge IoT Devices.pdfObject Automation

FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...Numenta

Scalable Learning in Computer Visionbutest

Deep Dive into DynamoDBAWS Germany

Memory efficient java tutorial practices and challengesmustafa sarac

Apache Spark: The Next Gen toolset for Big Data Processingprajods

Entity embeddings for categorical dataPaul Skeie

In memory grids IMDGPrateek Jain

Performance Management in ‘Big Data’ ApplicationsMichael Kopp

Azure Databricks for Data ScientistsRichard Garris

Introduction to STINGERrobertmccoll

How to Effectively Combine Numerical Features and Categorical FeaturesDomino Data Lab

(DAT203) Building Graph Databases on AWSAmazon Web Services

From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big DataHakka Labs

Cassandra trainingAndrás Fehér

Deep learning at the edge: 100x Inference improvement on edge devicesNumenta

MyHeritage backend group - build to scaleRan Levy

Semelhante a Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark (20)

Making Machine Learning Scale: Single Machine and Distributed

BSSML17 - Deepnets

Deploying Pretrained Model In Edge IoT Devices.pdf

FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...

Scalable Learning in Computer Vision

Deep Dive into DynamoDB

Memory efficient java tutorial practices and challenges

Apache Spark: The Next Gen toolset for Big Data Processing

Entity embeddings for categorical data

In memory grids IMDG

Performance Management in ‘Big Data’ Applications

Azure Databricks for Data Scientists

Introduction to STINGER

How to Effectively Combine Numerical Features and Categorical Features

(DAT203) Building Graph Databases on AWS

From Pipelines to Refineries: Scaling Big Data Applications

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Cassandra training

Deep learning at the edge: 100x Inference improvement on edge devices

MyHeritage backend group - build to scale

Mais de Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman

Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman

Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

RISELab:Enabling Intelligent Real-Time DecisionsJen Aman

Spatial Analysis On Histological Images Using SparkJen Aman

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman

Spark at Bloomberg: Dynamically Composable Analytics Jen Aman

EclairJS = Node.Js + Apache SparkJen Aman

Spark: Interactive To ProductionJen Aman

High-Performance Python On SparkJen Aman

Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman

Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Jen Aman

Utilizing Human Data Validation For KPI Analysis And Machine LearningJen Aman

Mais de Jen Aman (15)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

Snorkel: Dark Data and Machine Learning with Christopher Ré

Deep Learning on Apache® Spark™: Workflows and Best Practices

Deep Learning on Apache® Spark™ : Workflows and Best Practices

RISELab:Enabling Intelligent Real-Time Decisions

Spatial Analysis On Histological Images Using Spark

Re-Architecting Spark For Performance Understandability

Efficient State Management With Spark 2.0 And Scale-Out Databases

Spark at Bloomberg: Dynamically Composable Analytics

EclairJS = Node.Js + Apache Spark

Spark: Interactive To Production

High-Performance Python On Spark

Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...

Temporal Operators For Spark Streaming And Its Application For Office365 Serv...

Utilizing Human Data Validation For KPI Analysis And Machine Learning

Último

Easter Eggs From Star Wars and in cars 1 and 217djon017

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

How we prevented account sharing with MFAAndrei Kaleshka

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

IMA MSN - Medical Students Network (2).pptxdolaknnilon

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

办理(UC毕业证书)英国坎特伯雷大学毕业证成绩单原版一比一F La

毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff

在线办理UM毕业证迈阿密大学毕业证成绩单留信学历认证nhjeo1gg

INTRODUCTION TO Natural language processingsocarem879

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

1. Yggdrasil: Faster Decision Trees Using Column Partitioning in Spark Firas Abuzaid, Joseph Bradley, Feynman Liang, Andrew Feng, Lee Yang, Matei Zaharia, Ameet Talwalkar MIT, UCLA, Databricks, Yahoo!

2. A (Brief) Decision Tree Tutorial

3. Why Use Decision Trees? • Arbitrarily expressive, but easy to interpret • Simple to tune • Natural support for different feature types • Can easily extend to ensembles and boosting techniques

4. Why Use Deep Decision Trees? • Today’s datasets are growing both in size and dimension – More rows and more columns • To model high-dimensional data, we need to consider more splits – More splits ⇒ deeper trees

5. Decision Trees in Spark • Partition training set by row • Histograms used to compute splits • Workers compute partial histograms on subsets of rows • Master aggregates partial histograms, picks best feature to split on

6. What’s wrong with row-partitioning? 1. High communication costs, esp. for deep trees & many features – Exponential in the depth of the tree, linear in the number of features 2. To reduce communication, small number of thresholds are considered – Approximate split-finding using histograms – User specifies # thresholds – NB: Optimal split may not always be found

7. PROBLEM: PARTITIONING BY ROW ⇒ INEFFICIENT FOR DEEP TREES AND MANY FEATURES

8. PROBLEM: PARTITIONING BY ROW ⇒ TRADEOFF BETWEEN ACCURACY AND EFFICIENCY

9. SOLUTION: PARTITION BY COLUMN

10. Yggdrasil: The New Approach • Workers compute sufficient stats on entire columns • Master has one job: pick best global split • No approximation; all thresholds considered

11. Yggdrasil: The New Approach Communication cost much lower for deep trees & many features

12. Yggdrasil in Action

13. Optimizations in Yggdrasil • Columnar compression using RLE – Train directly on columns without decompressing • Label encoding for fewer cache misses • Sparse bit vectors to reduce communication overheads

14. Results • Classifying handwritten digits • 8.1 million rows • 784 columns • 18.2 GB (< 1% non-zeros) Highest accuracy at D=19 6x Speedup

15. Results • Regression task • 2 million rows • 3500 columns • 52.2 GB (< 1% zeros) 24x Speedup

16. Results • 2 million rows • For Yggdrasil, communication cost is independent of the number of features

17. Results • 2 million rows • Yggdrasil empirically outperforms Spark MLlib for deep trees and many features 8.5x Speedup

18. Using Yggdrasil • Available as a Spark package – download here • Direct integration with Spark ML Pipeline API • Support for various inputs: DataFrame, RDD[Labeled Point], or Parquet files

19. Using Yggdrasil Before: val dt = new DecisionTreeClassifier() .setFeaturesCol("indexedFeatures") .setLabelCol(labelColName) .setMaxDepth(params.maxDepth) .setMaxBins(params.maxBins) .setMinInstancesPerNode(params.minInstancesPerNode) .setMinInfoGain(params.minInfoGain) .setCacheNodeIds(params.cacheNodeIds) .setCheckpointInterval(params.checkpointInterval)

20. Using Yggdrasil After: val dt = new YggdrasilClassifier() .setFeaturesCol("indexedFeatures") .setLabelCol(labelColName) .setMaxDepth(params.maxDepth) .setMaxBins(params.maxBins) .setMinInstancesPerNode(params.minInstancesPerNode) .setMinInfoGain(params.minInfoGain) .setCacheNodeIds(params.cacheNodeIds) .setCheckpointInterval(params.checkpointInterval)

21. Why should I have to choose? • If Spark MLlib is better for shallow trees and few features… • And Yggdrasil is better for deeper trees and many features… • Why can’t I have both?

22. Future Work • You should be able to have both! • Next steps: – Merge Yggdrasilinto Spark MLlib v2.x – Add decision rule that automatically choosesthe best partitioning strategy for you: • by row (MLlib v1.6); or • by column (Yggdrasil)

23. THANKS! Email: fabuzaid@csail.mit.edu GitHub: https://github.com/fabuzaid21/ Twitter: @FirasTheBoss Website: http://firasabuzaid.com

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Semelhante a Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark (20)

Mais de Jen Aman

Mais de Jen Aman (15)

Último

Último (20)

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark