SlideShare uma empresa Scribd logo
1 de 44
Distributed Machine and
Deep Learning at Scale with
Apache Ignite
Yuriy Babak
Yuriy Babak
Head of ML/DL framework
development at GridGain
Apache Ignite committer
2019 © GridGain Systems
Agenda
● Introduction of distributed ML/DL
● Small math background
● Apache Ignite ML overview
● External integrations
2019 © GridGain Systems
Apache Ignite
Distributed ML
with Apache Ignite
2019 © GridGain Systems
Why do we need distributed ML?
● Data preprocessing
○ Big amount of data on different machines
○ High costs of ETL
● Model training
○ We cannot store all data on single server
○ Data irregularity
● Inference
○ Intensive data flow
○ Data co-located inference
○ Network latency
2019 © GridGain Systems
Why do we need distributed ML?
● Data preprocessing
○ Big amount of data on different machines
○ High costs of ETL
● Model training
○ We cannot store all data on single server
○ Data irregularity
● Inference
○ Intensive data flow
○ Data co-located inference
○ Network latency
2019 © GridGain Systems
Why do we need distributed ML?
● Data preprocessing
○ Big amount of data on different machines
○ High costs of ETL
● Model training
○ We cannot store all data on single server
○ Data irregularity
● Inference
○ Intensive data flow
○ Data co-located inference
○ Network latency
2019 © GridGain Systems
Common approach
Local calculations Result aggregationModel
Model update
2019 © GridGain Systems
Ignite partition-based distributed dataset
Partition Data Dataset Context Dataset Data
Upstream Cache Context Cache On-Heap
Learning Env
On-Heap
Durable Stateless Durable Recoverable
double[][] x = …
double[] y = ...
double[][] x = …
double[] y = ...
Partition Based Dataset StructuresSource Data
2019 © GridGain Systems
Let`s do the math!
2019 © GridGain Systems
Example: Gradient descent
• Model: Linear regression
f(x)
x
2019 © GridGain Systems
Example: Gradient Descent
2019 © GridGain Systems
• Model: Linear regression
• Loss function:
2019 © GridGain Systems
Example: Gradient Descent
2019 © GridGain Systems
• Model: Linear regression
• Loss function: MSE
• Gradient:
2019 © GridGain Systems
Example: Gradient Descent
2019 © GridGain Systems
• Model: Linear regression
• Loss function:
• Gradient:
2019 © GridGain Systems
Example: Random Forest
• Model: Random forest
2019 © GridGain Systems
2019 © GridGain Systems
Example: Random Forest
• Model: Random forest
• Loss function: min Gini coefficient
2019 © GridGain Systems
2019 © GridGain Systems
Example: Random Forest
• Model: Random forest
• Objective function: min Gini coefficient for each sub-tree
• Gradient: doesn’t exist (what shall we do?)
2019 © GridGain Systems
2019 © GridGain Systems
Example: Random Forest
2019 © GridGain Systems
2019 © GridGain Systems
Example: Random Forest
• Model: Random forest
• Objective function: min Gini coefficient for each sub-tree
• Gradient: doesn’t exist
• What to do: sum histograms
2019 © GridGain Systems
=+
2019 © GridGain Systems
Example: Random Forest
• Model: Random forest
• Objective function: min Gini coefficient for each sub-tree
• Gradient: doesn’t exist
• What to do:
– One can sum histograms
– Summarizing histograms - commutative and associative operation
– So you can count the histograms using map-reduce!
2019 © GridGain Systems
2019 © GridGain Systems
Example: Random Forest
• Model: Random forest
• Objective function: min Gini coefficient for each sub-tree
• Gradient: doesn’t exist
• What to do: approximation using histogram partitioning
• Code: RandomForestTrainer, RandomForestClassifierTrainer
2019 © GridGain Systems
Distributed ML
with Apache Ignite
2019 © GridGain Systems
What does Apache Ignite support?
• Classical ML
– classification
– regression
– clustering
2019 © GridGain Systems
2019 © GridGain Systems
Ignite ML API
Vectorizer<Integer, BinaryObject, String, Double> vectorizer =
new BinaryObjectVectorizer<Integer>(
"feature1",
“feature2”
).labeled("label");
LogisticRegressionSGDTrainer trainer = new LogisticRegressionSGDTrainer();
LogisticRegressionModel mdl = trainer.fit(ignite, dataCache, vectorizer);
2019 © GridGain Systems
What does Apache Ignite support?
• Classical ML
– classification
– regression
– clustering
• Preprocessing
– binarization, imputing
– normalization, scaling
– string encoding
• Pipelines
• Фреймворк для ансамблей
• Интеграция с Tensorflow
– как источник данных
– как менеджмент кластера
2019 © GridGain Systems
2019 © GridGain Systems
Preprocessing and Pipelines
Vectorizer<Integer, Vector, Integer, Double> vectorizer
= new DummyVectorizer<Integer>(0, 3, 4, 5, 6, 8, 10).labeled(1);
PipelineMdl<Integer, Vector> mdl =
new Pipeline<Integer, Vector, Integer, Double>()
.addVectorizer(vectorizer)
.addPreprocessingTrainer(new EncoderTrainer<Integer, Vector>()
.withEncoderType(EncoderType.STRING_ENCODER)
.withEncodedFeature(1)
.withEncodedFeature(6))
.addPreprocessingTrainer(new ImputerTrainer<Integer, Vector>())
.addPreprocessingTrainer(new MinMaxScalerTrainer<Integer, Vector>())
.addPreprocessingTrainer(new NormalizationTrainer<Integer, Vector>()
.withP(1))
.addTrainer(new DecisionTreeClassificationTrainer(5, 0))
.fit(ignite, dataCache);
2019 © GridGain Systems
What does Apache Ignite support?
• Classical ML
– classification
– regression
– clustering
• Preprocessing
– binarization, imputing
– normalization, scaling
– string encoding
• Pipelines
• Ensemble Framework
2019 © GridGain Systems
2019 © GridGain Systems
Bagging, Boosting and Stacking
DatasetTrainer<LogisticRegressionModel, Double> trainer =
new LogisticRegressionSGDTrainer(...)...;
BaggedTrainer<Double> baggedTrainer = TrainerTransformers.makeBagged(trainer,
// ensemble size, subsample ration, feature vector size, features subspace dim
7, 0.7, 2, 2,
new onMajorityPredictionsAggregator());
2019 © GridGain Systems
What does Apache Ignite support?
• Classical ML
– classification
– regression
– clustering
• Preprocessing
– binarization, imputing
– normalization, scaling
– string encoding
• Pipelines
• Ensemble Framework
• Continuous learning
2019 © GridGain Systems
2019 © GridGain Systems
Online learning
SVMLinearClassificationTrainer trainer = new SVMLinearClassificationTrainer();
SVMLinearClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer);
SVMLinearClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2,
vectorizer);
2019 © GridGain Systems
What does Apache Ignite support?
• Classical ML
– classification
– regression
– clustering
• Preprocessing
– binarization, imputing
– normalization, scaling
– string encoding
• Pipelines
• Ensemble Framework
• Continuous learning
• SQL in-place inference
2019 © GridGain Systems
2019 © GridGain Systems
IMDB with built-in ML
IgniteModelStorageUtil.saveModel(ignite, model, “titanik_model_tree”);
QueryCursor<List<?>> cursor = cache.query(new SqlFieldsQuery("select " +
"survived as truth, " +
"predict('titanik_model_tree', pclass, age, sibsp, parch, fare, case
sex when 'male' then 1 else 0 end) as prediction " +
"from titanik_train"))
2019 © GridGain Systems
External integrations
2019 © GridGain Systems
Model import
2019 © GridGain Systems
Inference in Ignite ML
2019 © GridGain Systems
Request queue
Response queue
Inference
Service
Inference
Service
Inference
Gateway
2019 © GridGain Systems
Import / Inference API
AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 4, 4);
XGModelParser parser = new XGModelParser();
try (Model<NamedVector, Future<Double>> mdl = mdlBuilder.build(reader, parser);
...
double prediction = mdl.predict(testObj).get();
}
2019 © GridGain Systems
TensorFlow on Apache Ignite
• Ignite Dataset
• IGFS Plugin
• Distributed Training
• More info here
#UnifiedAnalytics #SparkAISummit
>>> import tensorflow as tf
>>> from tensorflow.contrib.ignite import IgniteDataset
>>>
>>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE")
>>> iterator = dataset.make_one_shot_iterator()
>>> next_obj = iterator.get_next()
>>>
>>> with tf.Session() as sess:
>>> for _ in range(3):
>>> print(sess.run(next_obj))
{'key': 1, 'val': {'NAME': b'WARM KITTY'}}
{'key': 2, 'val': {'NAME': b'SOFT KITTY'}}
{'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
2019 © GridGain Systems
Apache Ignite with GridGain ML Python API
GridGain ML client library provides user applications the ability to work
with GridGain ML functionality using Py4J as an integration mechanism.
If you want to use ggml in your project, you may install it from PyPI:
$ pip install ggml
2019 © GridGain Systems
Apache Ignite with GridGain ML Python API
GridGain ML client library provides user applications the ability to work
with GridGain ML functionality using Py4J as an integration mechanism.
If you want to use ggml in your project, you may install it from PyPI:
$ pip install ggml
NB: available only for Apache Ignite master and for GG 8.8
2019 © GridGain Systems
Apache Ignite with GridGain ML Python API
from ggml.regression import LinearRegressionTrainer
trainer = LinearRegressionTrainer()
model = trainer.fit(x_train, y_train)
r2_score(y_test, model.predict(x_test))
2019 © GridGain Systems
Apache Ignite ML Tutorial
https://github.com/apache/ignite/
org.apache.ignite.examples.ml.tutorial
2019 © GridGain Systems
2019 © GridGain Systems
Distributed Machine and Deep Learning
at Scale with Apache Ignite
Links:
• http://ignite.apache.org/
• https://medium.com/tensorflow/tensorflow-on-apache-ignite-99f1fc60efeb
• https://github.com/gridgain/ml-python-api
• http://bit.ly/DataNativesOslo
Email:
● user@ignite.apache.org
● dev@ignite.apache.org
● ybabak@gridgain.com
2019 © GridGain Systems
Preprocessing and Pipelines

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Tape Is a Four Letter Word: Back Up to the Cloud in Under an Hour (STG201) - ...
Tape Is a Four Letter Word: Back Up to the Cloud in Under an Hour (STG201) - ...Tape Is a Four Letter Word: Back Up to the Cloud in Under an Hour (STG201) - ...
Tape Is a Four Letter Word: Back Up to the Cloud in Under an Hour (STG201) - ...
 
Google Cloud Platform Tutorial | GCP Fundamentals | Edureka
Google Cloud Platform Tutorial | GCP Fundamentals | EdurekaGoogle Cloud Platform Tutorial | GCP Fundamentals | Edureka
Google Cloud Platform Tutorial | GCP Fundamentals | Edureka
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
Deep learning acceleration with Amazon Elastic Inference
Deep learning acceleration with Amazon Elastic Inference  Deep learning acceleration with Amazon Elastic Inference
Deep learning acceleration with Amazon Elastic Inference
 
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data PlatformBDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
BDW16 London - William Vambenepe, Google - 3rd Generation Data Platform
 
AWS CodeStar 및 Cloud9을 통한 서버리스(Serverless) 앱 개발 길잡이 - 윤석찬 (AWS 테크에반젤리스트)
AWS CodeStar 및 Cloud9을 통한 서버리스(Serverless) 앱 개발 길잡이 - 윤석찬 (AWS 테크에반젤리스트)AWS CodeStar 및 Cloud9을 통한 서버리스(Serverless) 앱 개발 길잡이 - 윤석찬 (AWS 테크에반젤리스트)
AWS CodeStar 및 Cloud9을 통한 서버리스(Serverless) 앱 개발 길잡이 - 윤석찬 (AWS 테크에반젤리스트)
 
Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0Special Purpose Quantum Annealing Quantum Computer v1.0
Special Purpose Quantum Annealing Quantum Computer v1.0
 
Amazon Aurora 深度探討
Amazon Aurora 深度探討Amazon Aurora 深度探討
Amazon Aurora 深度探討
 
Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CO...
Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CO...Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CO...
Monitor the World: Meaningful Metrics for Containerized Apps and Clusters (CO...
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Easy and Efficient Batch Computing on AWS
Easy and Efficient Batch Computing on AWSEasy and Efficient Batch Computing on AWS
Easy and Efficient Batch Computing on AWS
 
(CMP202) Engineering Simulation and Analysis in the Cloud
(CMP202) Engineering Simulation and Analysis in the Cloud(CMP202) Engineering Simulation and Analysis in the Cloud
(CMP202) Engineering Simulation and Analysis in the Cloud
 
Optimize Your Amazon ECS Environment
Optimize Your Amazon ECS EnvironmentOptimize Your Amazon ECS Environment
Optimize Your Amazon ECS Environment
 
Engineering Genomic Big Data Analytics at A Global Scale
Engineering Genomic Big Data Analytics at A Global ScaleEngineering Genomic Big Data Analytics at A Global Scale
Engineering Genomic Big Data Analytics at A Global Scale
 
Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAware
 
Top 13 best security practices for Azure
Top 13 best security practices for AzureTop 13 best security practices for Azure
Top 13 best security practices for Azure
 
Working with Scalable Machine Learning Algorithms in Amazon SageMaker - AWS O...
Working with Scalable Machine Learning Algorithms in Amazon SageMaker - AWS O...Working with Scalable Machine Learning Algorithms in Amazon SageMaker - AWS O...
Working with Scalable Machine Learning Algorithms in Amazon SageMaker - AWS O...
 
PAC 2019 virtual Stefano Doni
PAC 2019 virtual Stefano Doni   PAC 2019 virtual Stefano Doni
PAC 2019 virtual Stefano Doni
 
Intro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS CloudIntro to High Performance Computing in the AWS Cloud
Intro to High Performance Computing in the AWS Cloud
 
Google Cloud Platform Empowers TensorFlow and Machine Learning
Google Cloud Platform Empowers TensorFlow and Machine LearningGoogle Cloud Platform Empowers TensorFlow and Machine Learning
Google Cloud Platform Empowers TensorFlow and Machine Learning
 

Semelhante a Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with Apache Ignite" - Yuriy Babak

Semelhante a Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with Apache Ignite" - Yuriy Babak (20)

Continuous Machine and Deep Learning with Apache Ignite
Continuous Machine and Deep Learning with Apache IgniteContinuous Machine and Deep Learning with Apache Ignite
Continuous Machine and Deep Learning with Apache Ignite
 
Python tutorial for ML
Python tutorial for MLPython tutorial for ML
Python tutorial for ML
 
In-Memory Computing Essentials for Software Engineers
In-Memory Computing Essentials for Software EngineersIn-Memory Computing Essentials for Software Engineers
In-Memory Computing Essentials for Software Engineers
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
Deploying MariaDB for HA on Google Cloud Platform
Deploying MariaDB for HA on Google Cloud PlatformDeploying MariaDB for HA on Google Cloud Platform
Deploying MariaDB for HA on Google Cloud Platform
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic Platform
 
Ippon Technologies: Multi-criteria queries on a Cassandra application
Ippon Technologies: Multi-criteria queries on a Cassandra applicationIppon Technologies: Multi-criteria queries on a Cassandra application
Ippon Technologies: Multi-criteria queries on a Cassandra application
 
Multi criteria queries on a cassandra application
Multi criteria queries on a cassandra applicationMulti criteria queries on a cassandra application
Multi criteria queries on a cassandra application
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
 
MongoDB World 2019: Unleash the Power of the MongoDB Aggregation Framework
MongoDB World 2019: Unleash the Power of the MongoDB Aggregation FrameworkMongoDB World 2019: Unleash the Power of the MongoDB Aggregation Framework
MongoDB World 2019: Unleash the Power of the MongoDB Aggregation Framework
 
Distributed Model Training using MXNet with Horovod
Distributed Model Training using MXNet with HorovodDistributed Model Training using MXNet with Horovod
Distributed Model Training using MXNet with Horovod
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...
 
Using Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High ScalabilityUsing Grid Technologies in the Cloud for High Scalability
Using Grid Technologies in the Cloud for High Scalability
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...IRJET-  	  A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 

Mais de Dataconomy Media

Mais de Dataconomy Media (20)

Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & 	David An...Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & 	David An...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
 
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
 
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
 
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
 
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...Data Natives meets DataRobot |  "Build and deploy an anti-money laundering mo...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
 
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
 
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0  | "Building Kubernetes Operators with KUDO for Dat...Data Natives Vienna v 7.0  | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
 
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
 
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0  | "The Data Lorax: Planting the Seeds of Fairness...Data Natives Cologne v 4.0  | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
 
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
 
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
 
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
 
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
 
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
 
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
 
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
 
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
 
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
 
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
 
Big Data Helsinki v 3 | "What you should know about PSD2 APIs?" - Joonas Tomperi
Big Data Helsinki v 3 | "What you should know about PSD2 APIs?" - Joonas TomperiBig Data Helsinki v 3 | "What you should know about PSD2 APIs?" - Joonas Tomperi
Big Data Helsinki v 3 | "What you should know about PSD2 APIs?" - Joonas Tomperi
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with Apache Ignite" - Yuriy Babak

  • 1. Distributed Machine and Deep Learning at Scale with Apache Ignite Yuriy Babak
  • 2. Yuriy Babak Head of ML/DL framework development at GridGain Apache Ignite committer
  • 3. 2019 © GridGain Systems Agenda ● Introduction of distributed ML/DL ● Small math background ● Apache Ignite ML overview ● External integrations
  • 4. 2019 © GridGain Systems Apache Ignite
  • 6. 2019 © GridGain Systems Why do we need distributed ML? ● Data preprocessing ○ Big amount of data on different machines ○ High costs of ETL ● Model training ○ We cannot store all data on single server ○ Data irregularity ● Inference ○ Intensive data flow ○ Data co-located inference ○ Network latency
  • 7. 2019 © GridGain Systems Why do we need distributed ML? ● Data preprocessing ○ Big amount of data on different machines ○ High costs of ETL ● Model training ○ We cannot store all data on single server ○ Data irregularity ● Inference ○ Intensive data flow ○ Data co-located inference ○ Network latency
  • 8. 2019 © GridGain Systems Why do we need distributed ML? ● Data preprocessing ○ Big amount of data on different machines ○ High costs of ETL ● Model training ○ We cannot store all data on single server ○ Data irregularity ● Inference ○ Intensive data flow ○ Data co-located inference ○ Network latency
  • 9. 2019 © GridGain Systems Common approach Local calculations Result aggregationModel Model update
  • 10. 2019 © GridGain Systems Ignite partition-based distributed dataset Partition Data Dataset Context Dataset Data Upstream Cache Context Cache On-Heap Learning Env On-Heap Durable Stateless Durable Recoverable double[][] x = … double[] y = ... double[][] x = … double[] y = ... Partition Based Dataset StructuresSource Data
  • 11. 2019 © GridGain Systems Let`s do the math!
  • 12. 2019 © GridGain Systems Example: Gradient descent • Model: Linear regression f(x) x
  • 13. 2019 © GridGain Systems Example: Gradient Descent 2019 © GridGain Systems • Model: Linear regression • Loss function:
  • 14. 2019 © GridGain Systems Example: Gradient Descent 2019 © GridGain Systems • Model: Linear regression • Loss function: MSE • Gradient:
  • 15. 2019 © GridGain Systems Example: Gradient Descent 2019 © GridGain Systems • Model: Linear regression • Loss function: • Gradient:
  • 16. 2019 © GridGain Systems Example: Random Forest • Model: Random forest 2019 © GridGain Systems
  • 17. 2019 © GridGain Systems Example: Random Forest • Model: Random forest • Loss function: min Gini coefficient 2019 © GridGain Systems
  • 18. 2019 © GridGain Systems Example: Random Forest • Model: Random forest • Objective function: min Gini coefficient for each sub-tree • Gradient: doesn’t exist (what shall we do?) 2019 © GridGain Systems
  • 19. 2019 © GridGain Systems Example: Random Forest 2019 © GridGain Systems
  • 20. 2019 © GridGain Systems Example: Random Forest • Model: Random forest • Objective function: min Gini coefficient for each sub-tree • Gradient: doesn’t exist • What to do: sum histograms 2019 © GridGain Systems =+
  • 21. 2019 © GridGain Systems Example: Random Forest • Model: Random forest • Objective function: min Gini coefficient for each sub-tree • Gradient: doesn’t exist • What to do: – One can sum histograms – Summarizing histograms - commutative and associative operation – So you can count the histograms using map-reduce! 2019 © GridGain Systems
  • 22. 2019 © GridGain Systems Example: Random Forest • Model: Random forest • Objective function: min Gini coefficient for each sub-tree • Gradient: doesn’t exist • What to do: approximation using histogram partitioning • Code: RandomForestTrainer, RandomForestClassifierTrainer 2019 © GridGain Systems
  • 24. 2019 © GridGain Systems What does Apache Ignite support? • Classical ML – classification – regression – clustering 2019 © GridGain Systems
  • 25. 2019 © GridGain Systems Ignite ML API Vectorizer<Integer, BinaryObject, String, Double> vectorizer = new BinaryObjectVectorizer<Integer>( "feature1", “feature2” ).labeled("label"); LogisticRegressionSGDTrainer trainer = new LogisticRegressionSGDTrainer(); LogisticRegressionModel mdl = trainer.fit(ignite, dataCache, vectorizer);
  • 26. 2019 © GridGain Systems What does Apache Ignite support? • Classical ML – classification – regression – clustering • Preprocessing – binarization, imputing – normalization, scaling – string encoding • Pipelines • Фреймворк для ансамблей • Интеграция с Tensorflow – как источник данных – как менеджмент кластера 2019 © GridGain Systems
  • 27. 2019 © GridGain Systems Preprocessing and Pipelines Vectorizer<Integer, Vector, Integer, Double> vectorizer = new DummyVectorizer<Integer>(0, 3, 4, 5, 6, 8, 10).labeled(1); PipelineMdl<Integer, Vector> mdl = new Pipeline<Integer, Vector, Integer, Double>() .addVectorizer(vectorizer) .addPreprocessingTrainer(new EncoderTrainer<Integer, Vector>() .withEncoderType(EncoderType.STRING_ENCODER) .withEncodedFeature(1) .withEncodedFeature(6)) .addPreprocessingTrainer(new ImputerTrainer<Integer, Vector>()) .addPreprocessingTrainer(new MinMaxScalerTrainer<Integer, Vector>()) .addPreprocessingTrainer(new NormalizationTrainer<Integer, Vector>() .withP(1)) .addTrainer(new DecisionTreeClassificationTrainer(5, 0)) .fit(ignite, dataCache);
  • 28. 2019 © GridGain Systems What does Apache Ignite support? • Classical ML – classification – regression – clustering • Preprocessing – binarization, imputing – normalization, scaling – string encoding • Pipelines • Ensemble Framework 2019 © GridGain Systems
  • 29. 2019 © GridGain Systems Bagging, Boosting and Stacking DatasetTrainer<LogisticRegressionModel, Double> trainer = new LogisticRegressionSGDTrainer(...)...; BaggedTrainer<Double> baggedTrainer = TrainerTransformers.makeBagged(trainer, // ensemble size, subsample ration, feature vector size, features subspace dim 7, 0.7, 2, 2, new onMajorityPredictionsAggregator());
  • 30. 2019 © GridGain Systems What does Apache Ignite support? • Classical ML – classification – regression – clustering • Preprocessing – binarization, imputing – normalization, scaling – string encoding • Pipelines • Ensemble Framework • Continuous learning 2019 © GridGain Systems
  • 31. 2019 © GridGain Systems Online learning SVMLinearClassificationTrainer trainer = new SVMLinearClassificationTrainer(); SVMLinearClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer); SVMLinearClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);
  • 32. 2019 © GridGain Systems What does Apache Ignite support? • Classical ML – classification – regression – clustering • Preprocessing – binarization, imputing – normalization, scaling – string encoding • Pipelines • Ensemble Framework • Continuous learning • SQL in-place inference 2019 © GridGain Systems
  • 33. 2019 © GridGain Systems IMDB with built-in ML IgniteModelStorageUtil.saveModel(ignite, model, “titanik_model_tree”); QueryCursor<List<?>> cursor = cache.query(new SqlFieldsQuery("select " + "survived as truth, " + "predict('titanik_model_tree', pclass, age, sibsp, parch, fare, case sex when 'male' then 1 else 0 end) as prediction " + "from titanik_train")) 2019 © GridGain Systems
  • 35. 2019 © GridGain Systems Model import
  • 36. 2019 © GridGain Systems Inference in Ignite ML 2019 © GridGain Systems Request queue Response queue Inference Service Inference Service Inference Gateway
  • 37. 2019 © GridGain Systems Import / Inference API AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 4, 4); XGModelParser parser = new XGModelParser(); try (Model<NamedVector, Future<Double>> mdl = mdlBuilder.build(reader, parser); ... double prediction = mdl.predict(testObj).get(); }
  • 38. 2019 © GridGain Systems TensorFlow on Apache Ignite • Ignite Dataset • IGFS Plugin • Distributed Training • More info here #UnifiedAnalytics #SparkAISummit >>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") >>> iterator = dataset.make_one_shot_iterator() >>> next_obj = iterator.get_next() >>> >>> with tf.Session() as sess: >>> for _ in range(3): >>> print(sess.run(next_obj)) {'key': 1, 'val': {'NAME': b'WARM KITTY'}} {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
  • 39. 2019 © GridGain Systems Apache Ignite with GridGain ML Python API GridGain ML client library provides user applications the ability to work with GridGain ML functionality using Py4J as an integration mechanism. If you want to use ggml in your project, you may install it from PyPI: $ pip install ggml
  • 40. 2019 © GridGain Systems Apache Ignite with GridGain ML Python API GridGain ML client library provides user applications the ability to work with GridGain ML functionality using Py4J as an integration mechanism. If you want to use ggml in your project, you may install it from PyPI: $ pip install ggml NB: available only for Apache Ignite master and for GG 8.8
  • 41. 2019 © GridGain Systems Apache Ignite with GridGain ML Python API from ggml.regression import LinearRegressionTrainer trainer = LinearRegressionTrainer() model = trainer.fit(x_train, y_train) r2_score(y_test, model.predict(x_test))
  • 42. 2019 © GridGain Systems Apache Ignite ML Tutorial https://github.com/apache/ignite/ org.apache.ignite.examples.ml.tutorial 2019 © GridGain Systems
  • 43. 2019 © GridGain Systems Distributed Machine and Deep Learning at Scale with Apache Ignite Links: • http://ignite.apache.org/ • https://medium.com/tensorflow/tensorflow-on-apache-ignite-99f1fc60efeb • https://github.com/gridgain/ml-python-api • http://bit.ly/DataNativesOslo Email: ● user@ignite.apache.org ● dev@ignite.apache.org ● ybabak@gridgain.com
  • 44. 2019 © GridGain Systems Preprocessing and Pipelines