Mais conteúdo relacionado Semelhante a Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with Apache Ignite" - Yuriy Babak (20) Mais de Dataconomy Media (20) Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with Apache Ignite" - Yuriy Babak3. 2019 © GridGain Systems
Agenda
● Introduction of distributed ML/DL
● Small math background
● Apache Ignite ML overview
● External integrations
6. 2019 © GridGain Systems
Why do we need distributed ML?
● Data preprocessing
○ Big amount of data on different machines
○ High costs of ETL
● Model training
○ We cannot store all data on single server
○ Data irregularity
● Inference
○ Intensive data flow
○ Data co-located inference
○ Network latency
7. 2019 © GridGain Systems
Why do we need distributed ML?
● Data preprocessing
○ Big amount of data on different machines
○ High costs of ETL
● Model training
○ We cannot store all data on single server
○ Data irregularity
● Inference
○ Intensive data flow
○ Data co-located inference
○ Network latency
8. 2019 © GridGain Systems
Why do we need distributed ML?
● Data preprocessing
○ Big amount of data on different machines
○ High costs of ETL
● Model training
○ We cannot store all data on single server
○ Data irregularity
● Inference
○ Intensive data flow
○ Data co-located inference
○ Network latency
9. 2019 © GridGain Systems
Common approach
Local calculations Result aggregationModel
Model update
10. 2019 © GridGain Systems
Ignite partition-based distributed dataset
Partition Data Dataset Context Dataset Data
Upstream Cache Context Cache On-Heap
Learning Env
On-Heap
Durable Stateless Durable Recoverable
double[][] x = …
double[] y = ...
double[][] x = …
double[] y = ...
Partition Based Dataset StructuresSource Data
12. 2019 © GridGain Systems
Example: Gradient descent
• Model: Linear regression
f(x)
x
13. 2019 © GridGain Systems
Example: Gradient Descent
2019 © GridGain Systems
• Model: Linear regression
• Loss function:
14. 2019 © GridGain Systems
Example: Gradient Descent
2019 © GridGain Systems
• Model: Linear regression
• Loss function: MSE
• Gradient:
15. 2019 © GridGain Systems
Example: Gradient Descent
2019 © GridGain Systems
• Model: Linear regression
• Loss function:
• Gradient:
16. 2019 © GridGain Systems
Example: Random Forest
• Model: Random forest
2019 © GridGain Systems
17. 2019 © GridGain Systems
Example: Random Forest
• Model: Random forest
• Loss function: min Gini coefficient
2019 © GridGain Systems
18. 2019 © GridGain Systems
Example: Random Forest
• Model: Random forest
• Objective function: min Gini coefficient for each sub-tree
• Gradient: doesn’t exist (what shall we do?)
2019 © GridGain Systems
20. 2019 © GridGain Systems
Example: Random Forest
• Model: Random forest
• Objective function: min Gini coefficient for each sub-tree
• Gradient: doesn’t exist
• What to do: sum histograms
2019 © GridGain Systems
=+
21. 2019 © GridGain Systems
Example: Random Forest
• Model: Random forest
• Objective function: min Gini coefficient for each sub-tree
• Gradient: doesn’t exist
• What to do:
– One can sum histograms
– Summarizing histograms - commutative and associative operation
– So you can count the histograms using map-reduce!
2019 © GridGain Systems
22. 2019 © GridGain Systems
Example: Random Forest
• Model: Random forest
• Objective function: min Gini coefficient for each sub-tree
• Gradient: doesn’t exist
• What to do: approximation using histogram partitioning
• Code: RandomForestTrainer, RandomForestClassifierTrainer
2019 © GridGain Systems
24. 2019 © GridGain Systems
What does Apache Ignite support?
• Classical ML
– classification
– regression
– clustering
2019 © GridGain Systems
25. 2019 © GridGain Systems
Ignite ML API
Vectorizer<Integer, BinaryObject, String, Double> vectorizer =
new BinaryObjectVectorizer<Integer>(
"feature1",
“feature2”
).labeled("label");
LogisticRegressionSGDTrainer trainer = new LogisticRegressionSGDTrainer();
LogisticRegressionModel mdl = trainer.fit(ignite, dataCache, vectorizer);
26. 2019 © GridGain Systems
What does Apache Ignite support?
• Classical ML
– classification
– regression
– clustering
• Preprocessing
– binarization, imputing
– normalization, scaling
– string encoding
• Pipelines
• Фреймворк для ансамблей
• Интеграция с Tensorflow
– как источник данных
– как менеджмент кластера
2019 © GridGain Systems
27. 2019 © GridGain Systems
Preprocessing and Pipelines
Vectorizer<Integer, Vector, Integer, Double> vectorizer
= new DummyVectorizer<Integer>(0, 3, 4, 5, 6, 8, 10).labeled(1);
PipelineMdl<Integer, Vector> mdl =
new Pipeline<Integer, Vector, Integer, Double>()
.addVectorizer(vectorizer)
.addPreprocessingTrainer(new EncoderTrainer<Integer, Vector>()
.withEncoderType(EncoderType.STRING_ENCODER)
.withEncodedFeature(1)
.withEncodedFeature(6))
.addPreprocessingTrainer(new ImputerTrainer<Integer, Vector>())
.addPreprocessingTrainer(new MinMaxScalerTrainer<Integer, Vector>())
.addPreprocessingTrainer(new NormalizationTrainer<Integer, Vector>()
.withP(1))
.addTrainer(new DecisionTreeClassificationTrainer(5, 0))
.fit(ignite, dataCache);
28. 2019 © GridGain Systems
What does Apache Ignite support?
• Classical ML
– classification
– regression
– clustering
• Preprocessing
– binarization, imputing
– normalization, scaling
– string encoding
• Pipelines
• Ensemble Framework
2019 © GridGain Systems
29. 2019 © GridGain Systems
Bagging, Boosting and Stacking
DatasetTrainer<LogisticRegressionModel, Double> trainer =
new LogisticRegressionSGDTrainer(...)...;
BaggedTrainer<Double> baggedTrainer = TrainerTransformers.makeBagged(trainer,
// ensemble size, subsample ration, feature vector size, features subspace dim
7, 0.7, 2, 2,
new onMajorityPredictionsAggregator());
30. 2019 © GridGain Systems
What does Apache Ignite support?
• Classical ML
– classification
– regression
– clustering
• Preprocessing
– binarization, imputing
– normalization, scaling
– string encoding
• Pipelines
• Ensemble Framework
• Continuous learning
2019 © GridGain Systems
31. 2019 © GridGain Systems
Online learning
SVMLinearClassificationTrainer trainer = new SVMLinearClassificationTrainer();
SVMLinearClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer);
SVMLinearClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2,
vectorizer);
32. 2019 © GridGain Systems
What does Apache Ignite support?
• Classical ML
– classification
– regression
– clustering
• Preprocessing
– binarization, imputing
– normalization, scaling
– string encoding
• Pipelines
• Ensemble Framework
• Continuous learning
• SQL in-place inference
2019 © GridGain Systems
33. 2019 © GridGain Systems
IMDB with built-in ML
IgniteModelStorageUtil.saveModel(ignite, model, “titanik_model_tree”);
QueryCursor<List<?>> cursor = cache.query(new SqlFieldsQuery("select " +
"survived as truth, " +
"predict('titanik_model_tree', pclass, age, sibsp, parch, fare, case
sex when 'male' then 1 else 0 end) as prediction " +
"from titanik_train"))
2019 © GridGain Systems
36. 2019 © GridGain Systems
Inference in Ignite ML
2019 © GridGain Systems
Request queue
Response queue
Inference
Service
Inference
Service
Inference
Gateway
37. 2019 © GridGain Systems
Import / Inference API
AsyncModelBuilder mdlBuilder = new IgniteDistributedModelBuilder(ignite, 4, 4);
XGModelParser parser = new XGModelParser();
try (Model<NamedVector, Future<Double>> mdl = mdlBuilder.build(reader, parser);
...
double prediction = mdl.predict(testObj).get();
}
38. 2019 © GridGain Systems
TensorFlow on Apache Ignite
• Ignite Dataset
• IGFS Plugin
• Distributed Training
• More info here
#UnifiedAnalytics #SparkAISummit
>>> import tensorflow as tf
>>> from tensorflow.contrib.ignite import IgniteDataset
>>>
>>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE")
>>> iterator = dataset.make_one_shot_iterator()
>>> next_obj = iterator.get_next()
>>>
>>> with tf.Session() as sess:
>>> for _ in range(3):
>>> print(sess.run(next_obj))
{'key': 1, 'val': {'NAME': b'WARM KITTY'}}
{'key': 2, 'val': {'NAME': b'SOFT KITTY'}}
{'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
39. 2019 © GridGain Systems
Apache Ignite with GridGain ML Python API
GridGain ML client library provides user applications the ability to work
with GridGain ML functionality using Py4J as an integration mechanism.
If you want to use ggml in your project, you may install it from PyPI:
$ pip install ggml
40. 2019 © GridGain Systems
Apache Ignite with GridGain ML Python API
GridGain ML client library provides user applications the ability to work
with GridGain ML functionality using Py4J as an integration mechanism.
If you want to use ggml in your project, you may install it from PyPI:
$ pip install ggml
NB: available only for Apache Ignite master and for GG 8.8
41. 2019 © GridGain Systems
Apache Ignite with GridGain ML Python API
from ggml.regression import LinearRegressionTrainer
trainer = LinearRegressionTrainer()
model = trainer.fit(x_train, y_train)
r2_score(y_test, model.predict(x_test))
42. 2019 © GridGain Systems
Apache Ignite ML Tutorial
https://github.com/apache/ignite/
org.apache.ignite.examples.ml.tutorial
2019 © GridGain Systems
43. 2019 © GridGain Systems
Distributed Machine and Deep Learning
at Scale with Apache Ignite
Links:
• http://ignite.apache.org/
• https://medium.com/tensorflow/tensorflow-on-apache-ignite-99f1fc60efeb
• https://github.com/gridgain/ml-python-api
• http://bit.ly/DataNativesOslo
Email:
● user@ignite.apache.org
● dev@ignite.apache.org
● ybabak@gridgain.com