SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
Apache Spark MLlib's
past trajectory and new directions
Joseph K. Bradley
Spark Summit 2017
2
About me
Software engineer at Databricks
Apache Spark committer & PMC member
Ph.D. Carnegie Mellon in Machine Learning
3
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeleyin 2009
33
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
4
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
44
DATABRICKS RUNTIME 3.0
• Apache Spark - optimized for the cloud
• Caching and optimization layer - DBIO
• Enterprise security - DBES
Try for free today.
databricks.com
5
5 years ago…
Zaharia	et	al.		
NSDI,	2012
Scalable ML
using Spark!
6
3 ½ years ago until today
People added some algorithms…
Classification
• Logisticregression & linear SVMs
• L1, L2, or elastic net regularization
• Decision trees
• Random forests
• Gradient-boosted trees
• Naive Bayes
• Multilayer perceptron
• One-vs-rest
• Streaming logisticregression
Regression
• Least squares
• L1, L2, or elastic net regularization
• Decision trees
• Random forests
• Gradient-boosted trees
• Isotonicregression
• Streaming least squares
Feature extraction, transformation & selection
• Binarizer
• Bucketizer
• Chi-Squared selection
• CountVectorizer
• Discrete cosine transform
• ElementwiseProduct
• Hashing term frequency
• Inverse document frequency
• MinMaxScaler
• Ngram
• Normalizer
• VectorAssembler
• VectorIndexer
• VectorSlicer
• Word2Vec
Clustering
• Gaussian mixture model
• K-Means
• Streaming K-Means
• Latent Dirichlet Allocatio
• Power Iteration Clusterin
Recommendation
• Alternating Least Squares (ALS)
Frequent itemsets
• FP-growth
• Prefixspan
Statistics
• Pearson correlation
• Spearman correlation
• Online summarization
• Chi-squared test
• Kernel density estimation
Linear Algebra
• Local & distributed dense &sparse matrices
• Matrix decompositions (PCA, SVD, QR)
7
Today
Apache Spark is widely used for ML.
• Many production use cases
• 1000s of commits
• 100s of contributors
• 10,000s of users
(on Databricks alone)
8
This talk
What?
How has MLlib developed, and what has it become?
So what?
What are the implications for the dev & user communities?
Now what?
Where could or should it go?
9
What?
How has MLlib developed, and what has it become?
1
Let’s do some data analytics
50+ algorithms and featurizers
Rapid development
Shift from adding algorithms to improving algorithms & infrastructure
Growing dev community
1
1
1
1
Major projects
0.8 master2.12.01.61.50.9 1.0 1.1 1.2 1.3 1.4
DataFrame-based API
ML persistence
Trees & ensembles
GLMs
SparkR
Pipelines Featurizers
Note: These are unofficial “project” labels basedon JIRA/PR activity.
1
So what?
What are the implications for the dev & usercommunities?
1
Integrating ML with the Big Data world
Seamless integration with SQL, DataFrames, Graphs, Streaming,
both within MLlib…
Data sources
CSV, JSON,Parquet, …
DataFrames
Simple, scalable
pre/post-processing
Graphs
Graph-based ML
implementations
Streaming
ML models deployed
with Streaming
1
Integrating ML with the Big Data world
Seamless integration with SQL, DataFrames, Graphs, Streaming,
both within MLlib…and outside MLlib
E.g., on spark-packages.org
• scikit-learn integration package
(“spark-sklearn”)
• Stanford CoreNLPintegration
(“spark-corenlp”)
Deep Learning libraries
• Deep Learning Pipelines
(“spark-deep-learning”)
• BigDL
• TensorFlowOnSpark
• Distributed training with Keras
(“dist-keras”)
Framework for integratingnon-“big data” or specialized ML libraries
1
Scaling
Workflow scalability
Same code on laptop with small data à cluster with big data
Big data scalability
Scale-out implementations
4x faster than
xgboost
Abuzaid et al. (2016) Chen & Guestrin (2016)
8x slower than
xgboost
1
Building workflows
Simple concepts: Transformers, Estimators, Models, Evaluators
• Unified, standardized APIs, including for algorithm parameters
Pipelines
• Repeatable workflows across Sparkdeployments and languages
• DataFrameAPIs and optimizations
• Automated model tuning
FeaturizationETL Training Evaluation
Tuning
2
Now what?
Where could or should it go?
2
Top items from the dev@ list…
• Continued speed & scalability
improvements
• Enhancements to core algorithms
• ML building blocks: linear algebra,
optimization
• Extensible and customizable APIs
Goals
Scale-out ML
Standard library
Extensible API
Theseare unofficialgoals based
on my experience+ discussions
on the dev@ mailing list.
2
Scale-out ML
General improvements
• DataFrame-based implementations
Targeted improvements
• Gradient-boosted trees: feature subsampling
• Linear models: vector-free L-BFGS for billions of features
• …
E.g., 10x scaling for
Connected Components
in GraphFrames
2
Standard library
General improvements
• NaN/null handling
• Instance weights
• Warm starts / initial models
• Multi-column feature transforms
Algorithm-specific improvements
• Trees: access tree structure from Python
• …
2
Extensible and customizable APIs
MLlib has ~50 algorithms.
CRAN lists 10,750 packages.
àUsers must be able to
modify algorithms or
write their own.
algorithms
usage
2
Spark Packages
340+ packages written for Spark
80+ packages for ML and Graphs
E.g.:
• GraphFrames (“graphframes”): DataFrame-based graphs
• Bisecting K-Means (“bisecting-kmeans”): now part of MLlib
• Stanford CoreNLP wrapper (“spark-corenlp”): UDFs for NLP
spark-packages.org
2
How to write a custom algorithm
You need to:
• Extendan API like Transformer, Estimator, Model, or Evaluator
MLlibprovides some help:
• UnaryTransformer
• DefaultParamsWritable/Readable
• defaultCopy
You get:
• Natural integration with DataFrames, Pipelines, modeltuning
2
Challenges in customization
Modify existing algorithms
with custom:
• Loss functions
• Optimizers
• Stopping criteria
• Callbacks
Implement new
algorithms without
worrying about:
• Boilerplate
• Python API
• ML building blocks
• Feature
attributes/metadata
2
Example of MLlib APIs
Deep Learning Pipelines for Apache Spark
spark-packages.org/package/databricks/spark-deep-learning
APIs are great for users:
• Plug & play algorithms
• Standard API
But challenges remain:
• Development requires expertise
• Community under development
2
My challenge to you
Can we fill out the long tail of ML use cases?
algorithms
usage
3
Resources
Spark Packages: spark-packages.org
• Plugin for building packages: https://github.com/databricks/sbt-spark-
package
• Tool for generating package templates:
https://github.com/databricks/spark-package-cmd-tool
Example packages:
• scikit-learn integration (“spark-sklearn”) – Python-only
• GraphFrames – Scala/Java/Python
Thank you!
Office hours today @ 3:50pm at Databricks booth
Twitter: @jkbatcmu

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksAccelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on Databricks
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
Insights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingInsights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured Streaming
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
 
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovSpark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with Ease
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
A Tale of Three Tools: Kubernetes, Jsonnet, and Bazel
A Tale of Three Tools: Kubernetes, Jsonnet, and BazelA Tale of Three Tools: Kubernetes, Jsonnet, and Bazel
A Tale of Three Tools: Kubernetes, Jsonnet, and Bazel
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
 

Semelhante a Apache Spark's MLlib's Past Trajectory and new Directions

2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
AnalyticsWeek
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 

Semelhante a Apache Spark's MLlib's Past Trajectory and new Directions (20)

Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
AI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI DayAI and Spark - IBM Community AI Day
AI and Spark - IBM Community AI Day
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 

Mais de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 

Último (20)

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 

Apache Spark's MLlib's Past Trajectory and new Directions

  • 1. Apache Spark MLlib's past trajectory and new directions Joseph K. Bradley Spark Summit 2017
  • 2. 2 About me Software engineer at Databricks Apache Spark committer & PMC member Ph.D. Carnegie Mellon in Machine Learning
  • 3. 3 TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeleyin 2009 33 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 4. 4 UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) 44 DATABRICKS RUNTIME 3.0 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today. databricks.com
  • 6. 6 3 ½ years ago until today People added some algorithms… Classification • Logisticregression & linear SVMs • L1, L2, or elastic net regularization • Decision trees • Random forests • Gradient-boosted trees • Naive Bayes • Multilayer perceptron • One-vs-rest • Streaming logisticregression Regression • Least squares • L1, L2, or elastic net regularization • Decision trees • Random forests • Gradient-boosted trees • Isotonicregression • Streaming least squares Feature extraction, transformation & selection • Binarizer • Bucketizer • Chi-Squared selection • CountVectorizer • Discrete cosine transform • ElementwiseProduct • Hashing term frequency • Inverse document frequency • MinMaxScaler • Ngram • Normalizer • VectorAssembler • VectorIndexer • VectorSlicer • Word2Vec Clustering • Gaussian mixture model • K-Means • Streaming K-Means • Latent Dirichlet Allocatio • Power Iteration Clusterin Recommendation • Alternating Least Squares (ALS) Frequent itemsets • FP-growth • Prefixspan Statistics • Pearson correlation • Spearman correlation • Online summarization • Chi-squared test • Kernel density estimation Linear Algebra • Local & distributed dense &sparse matrices • Matrix decompositions (PCA, SVD, QR)
  • 7. 7 Today Apache Spark is widely used for ML. • Many production use cases • 1000s of commits • 100s of contributors • 10,000s of users (on Databricks alone)
  • 8. 8 This talk What? How has MLlib developed, and what has it become? So what? What are the implications for the dev & user communities? Now what? Where could or should it go?
  • 9. 9 What? How has MLlib developed, and what has it become?
  • 10. 1 Let’s do some data analytics 50+ algorithms and featurizers Rapid development Shift from adding algorithms to improving algorithms & infrastructure Growing dev community
  • 11. 1
  • 12. 1
  • 13. 1
  • 14. 1 Major projects 0.8 master2.12.01.61.50.9 1.0 1.1 1.2 1.3 1.4 DataFrame-based API ML persistence Trees & ensembles GLMs SparkR Pipelines Featurizers Note: These are unofficial “project” labels basedon JIRA/PR activity.
  • 15. 1 So what? What are the implications for the dev & usercommunities?
  • 16. 1 Integrating ML with the Big Data world Seamless integration with SQL, DataFrames, Graphs, Streaming, both within MLlib… Data sources CSV, JSON,Parquet, … DataFrames Simple, scalable pre/post-processing Graphs Graph-based ML implementations Streaming ML models deployed with Streaming
  • 17. 1 Integrating ML with the Big Data world Seamless integration with SQL, DataFrames, Graphs, Streaming, both within MLlib…and outside MLlib E.g., on spark-packages.org • scikit-learn integration package (“spark-sklearn”) • Stanford CoreNLPintegration (“spark-corenlp”) Deep Learning libraries • Deep Learning Pipelines (“spark-deep-learning”) • BigDL • TensorFlowOnSpark • Distributed training with Keras (“dist-keras”) Framework for integratingnon-“big data” or specialized ML libraries
  • 18. 1 Scaling Workflow scalability Same code on laptop with small data à cluster with big data Big data scalability Scale-out implementations 4x faster than xgboost Abuzaid et al. (2016) Chen & Guestrin (2016) 8x slower than xgboost
  • 19. 1 Building workflows Simple concepts: Transformers, Estimators, Models, Evaluators • Unified, standardized APIs, including for algorithm parameters Pipelines • Repeatable workflows across Sparkdeployments and languages • DataFrameAPIs and optimizations • Automated model tuning FeaturizationETL Training Evaluation Tuning
  • 20. 2 Now what? Where could or should it go?
  • 21. 2 Top items from the dev@ list… • Continued speed & scalability improvements • Enhancements to core algorithms • ML building blocks: linear algebra, optimization • Extensible and customizable APIs Goals Scale-out ML Standard library Extensible API Theseare unofficialgoals based on my experience+ discussions on the dev@ mailing list.
  • 22. 2 Scale-out ML General improvements • DataFrame-based implementations Targeted improvements • Gradient-boosted trees: feature subsampling • Linear models: vector-free L-BFGS for billions of features • … E.g., 10x scaling for Connected Components in GraphFrames
  • 23. 2 Standard library General improvements • NaN/null handling • Instance weights • Warm starts / initial models • Multi-column feature transforms Algorithm-specific improvements • Trees: access tree structure from Python • …
  • 24. 2 Extensible and customizable APIs MLlib has ~50 algorithms. CRAN lists 10,750 packages. àUsers must be able to modify algorithms or write their own. algorithms usage
  • 25. 2 Spark Packages 340+ packages written for Spark 80+ packages for ML and Graphs E.g.: • GraphFrames (“graphframes”): DataFrame-based graphs • Bisecting K-Means (“bisecting-kmeans”): now part of MLlib • Stanford CoreNLP wrapper (“spark-corenlp”): UDFs for NLP spark-packages.org
  • 26. 2 How to write a custom algorithm You need to: • Extendan API like Transformer, Estimator, Model, or Evaluator MLlibprovides some help: • UnaryTransformer • DefaultParamsWritable/Readable • defaultCopy You get: • Natural integration with DataFrames, Pipelines, modeltuning
  • 27. 2 Challenges in customization Modify existing algorithms with custom: • Loss functions • Optimizers • Stopping criteria • Callbacks Implement new algorithms without worrying about: • Boilerplate • Python API • ML building blocks • Feature attributes/metadata
  • 28. 2 Example of MLlib APIs Deep Learning Pipelines for Apache Spark spark-packages.org/package/databricks/spark-deep-learning APIs are great for users: • Plug & play algorithms • Standard API But challenges remain: • Development requires expertise • Community under development
  • 29. 2 My challenge to you Can we fill out the long tail of ML use cases? algorithms usage
  • 30. 3 Resources Spark Packages: spark-packages.org • Plugin for building packages: https://github.com/databricks/sbt-spark- package • Tool for generating package templates: https://github.com/databricks/spark-package-cmd-tool Example packages: • scikit-learn integration (“spark-sklearn”) – Python-only • GraphFrames – Scala/Java/Python
  • 31. Thank you! Office hours today @ 3:50pm at Databricks booth Twitter: @jkbatcmu