O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Deep Learning - Continuous Operations

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Natively clouded Journey
Natively clouded Journey
Carregando em…3
×

Confira estes a seguir

1 de 45 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Deep Learning - Continuous Operations (20)

Anúncio

Mais recentes (20)

Anúncio

Deep Learning - Continuous Operations

  1. 1. FullStack Developers Israel CONTINUOS OPERATIONS DEEP LEARNING | HAGGAI PHILIP ZAGURY
  2. 2. Tikal Knowledge TIKAL INTRO WHO WE ARE ? ▸ Tikal helps ISV’s in Israel & abroad in their technological challenges. ▸ Our Engineers are Fullstack Developers with expertise in Android, DevOps, Java, JS, Python, ML ▸ We are passionate about technology and specialise in OpenSource technologies. ▸ Our Tech and Group leaders help establish & enhance existing software teams with innovative & creative thinking. https://www.meetup.com/full-stack-developer-il/
  3. 3. FullStack Developers Israel SELF INTRODUCTION ▸ My open thinking and open techniques ideology is driven by Open Source technologies and the collaborative manner defining my M.O. ▸ My solution driven approach is strongly based on hands-on and deep understanding of Operating Systems, Applications stacks and Software languages, Networking, Cloud in general and today more an more Cloud Native solutions. ▸ Technologies: ▸ Linux { just pick a flavour …} ▸ *Scripting ▸ Git ▸ Python/Go ▸ Cloud { public/private/hybrid } ▸ Docker ▸ Kubernetes
 HAGGAI PHILIP ZAGURY - DEVOPS ARCHITECT AND GROUP TECH LEAD
  4. 4. FullStack Developers Israel THE STORY … MACHINE LEARNING | CONTINUOUS OPERATIONS
  5. 5. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS WE NEED “CI/CD” FOR OUR MODEL TRAINING … ▸ What he didn’t say is … ▸ In-browser training ▸ Backed training ▸ Tensorflow training ▸ Tensorflow serving ▸ Storage [ for raw data & model ] …
  6. 6. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS THE LEARNING CURVE
  7. 7. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS A RELATIVELY SIMPLE USE CASE … TENSOR-FLOW TRAINING Server SERVER CLIENT - SERVE FRONTEND APP - COLLECT IMAGES - TRAIN -INFER Upload Images Serve Model Get trained Model Enrich Model with new data Upload Images Serve Protobuf Object store 1 2 3 4 5 6
  8. 8. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS A CLASSIC APP SERVER CLIENT - SERVE FRONTEND APP - COLLECT IMAGES - TRAIN -INFER Upload Images Serve Model Get trained Model Upload Images Object store 1 2 5 6
  9. 9. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS MODEL TRAINING … ‣ If your using a pre-trained model - it’s no different than using a backend / an api endpoint ! ‣ Training processes are complex and require Infrastructure As A Service & On demand ‣ Scalability ‣ faster Time to Market vs. faster results ‣ Scaling costs …
  10. 10. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS STAGE #1 ‣ python train_model.py
 3 Total data size: 332 4 Train X: (298, 7, 7, 256) 5 Train Y: (298, 2) 6 Test X: (34, 7, 7, 256) 7 Test Y: (34, 2) 8 Train on 298 samples, validate on 34 samples 9 Epoch 1/10 10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc: 0.9118 11 Epoch 2/10 12 298/298 [==============================] - 0s 1ms/step - loss: 0.1361 - acc: 0.9765 - val_loss: 0.0763 - val_acc: 1.0000 13 Epoch 3/10 14 298/298 [==============================] - 0s 1ms/step - loss: 0.0471 - acc: 0.9966 - val_loss: 0.0365 - val_acc: 1.0000 15 Epoch 4/10 16 298/298 [==============================] - 0s 1ms/step - loss: 0.0172 - acc: 1.0000 - val_loss: 0.0196 - val_acc: 1.0000 17 Epoch 5/10 18 298/298 [==============================] - 0s 1ms/step - loss: 0.0123 - acc: 1.0000 - val_loss: 0.0113 - val_acc: 1.0000 19 Epoch 6/10 20 298/298 [==============================] - 0s 1ms/step - loss: 0.0065 - acc: 1.0000 - val_loss: 0.0080 - val_acc: TENSOR-FLOW TRAINING
  11. 11. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS STAGE #2 - DOCKERIZE & PARAMETARIZE … ‣ docker run -v “${PWD}/data:/opt/data” tikal/webcam-controller- model:latest TENSOR-FLOW TRAINING 3 Total data size: 332 4 Train X: (298, 7, 7, 256) 5 Train Y: (298, 2) 6 Test X: (34, 7, 7, 256) 7 Test Y: (34, 2) 8 Train on 298 samples, validate on 34 samples 9 Epoch 1/10 10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc: 0.9118 11 Epoch 2/10 12 298/298 [==============================] - 0s 1ms/step - loss: 0.1361 - acc: 0.9765 - val_loss: 0.0763 - val_acc: 1.0000 13 Epoch 3/10 14 298/298 [==============================] - 0s 1ms/step - loss: 0.0471 - acc: 0.9966 - val_loss: 0.0365 - val_acc: 1.0000 15 Epoch 4/10 16 298/298 [==============================] - 0s 1ms/step - loss: 0.0172 - acc: 1.0000 - val_loss: 0.0196 - val_acc: 1.0000 17 Epoch 5/10 18 298/298 [==============================] - 0s 1ms/step - loss: 0.0123 - acc: 1.0000 - val_loss: 0.0113 - val_acc: 1.0000 19 Epoch 6/10
  12. 12. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS CONTINUOS INTEGRATION ‣ A Jenkins pipeline ‣ Build - get sample data / updated data ‣ Deploy model to cpu/gpu ‣ Train and record results ‣ Promote upload new model for “space invaders” micro service backend
  13. 13. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS THE GAME IS JUST A MEANS TO AN END … TENSOR-FLOW TRAINING TENSOR-FLOW TRAINING # epochs lr more flags 1 flags = tf.app.flags 2 flags.DEFINE_float("lr", 0.0001, "Learning Rate") 3 flags.DEFINE_string("units", "((50, 0.2), (40, 0.1))", "Configuration of hidden un 4 "Expected: tuple of tuple pairs. Each pair represent one hidde 5 "For instance: "((100, 0.2), (50, 0.3))" will create dense h 6 "dropout layer with rate of 0.2. Afterwards, it will create de 7 "dropout layer with rate of 0.3. If you wish to have hidden la 8 "second value. Example: "((100,), (50, 0.3))"") 9 flags.DEFINE_integer("epochs", 10, "Number of epochs") 10 flags.DEFINE_float("batch_frac", 0.3, "The fraction of training examples to consid 11 "For instance, 0.1 will divide the training to 10 batches") 12 flags.DEFINE_boolean("draw_plot", False, "Whether to draw a plot at the end") 13 flags.DEFINE_boolean("export_js", False, "Whether to export to a tenorflow.js mode 14 FLAGS = flags.FLAGS TENSOR-FLOW TRAINING # epochs lr more flags ‣ We need to train our model
 With different parameters to
 Reach the Optimal model parameters …
  14. 14. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS SACALING / MULTIPLEXING … TENSORFLOW SUPPORTS MULTI-PART / DISTRIBUTED FLOWS ‣ Running the same model with different parameters in order to choose the most efficient vs most accurate vs cost affective pipeline ! ‣ most efficient #of epochs / params https://www.tensorflow.org/performance/datasets_performance
  15. 15. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS A/B TESTING / CANARY RELEASES ?! MODEL VER 1.0 MODEL VER 1.7 MODEL VER 2.0 Storage Provider 60% 30% 10% Collect In-Browser 
 training
  16. 16. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS TRANSLATION … ▸ A flexible training model ▸ Parametarized flow ▸ Model Testing ▸ Promotion mechanism ▸ Data Import and preprocessing ▸ Post Processing
  17. 17. FullStack Developers IL REQUIREMENTS DRIVEN SOLUTION(S)
  18. 18. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS OPTIONS - AWS ML ▸ Use custom DL AMI’s [ we used them to get started … ]
  19. 19. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS OPTIONS - AWS ML ▸ Use custom DL AMI’s [ we used them to get started … ]
  20. 20. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS OPTIONS - AWS ML
  21. 21. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS OPTIONS - GCP ML/DL ▸ Assume you develop in the cloud / on the cloud ▸ Consume C/G/Tpu’s constantly ▸ Adjust your workflow to Google Patterns (which isn’t a bad thing …)
  22. 22. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS OPTIONS - GCP ML/DL ▸ TPC lock-in ? ▸ Wouldn’t it be nice to benchmark TPU & GPU on another provider ?!
  23. 23. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS OPTIONS - AZURE ML/DL
  24. 24. FullStack Developers Israel IT’S ALL ABOUT THE PIPELINE / WORKFLOW
  25. 25. FullStack Developers Israel TEXT IT’S ALL ABOUT THE PIPELINE / WORKFLOW ‣ You might be able to make this work … ‣ But !
  26. 26. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS THERES A PATTERN HERE … IDE Model Serving Model Storage Parameter injectionParameterized training Training Orchestrator 1 2 3 4 5 6
  27. 27. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS STAGE #3 - ADJUST OUR DOCKERIZED APP TO MY VENDOR … ‣ docker run -v “${PWD}/data:/opt/data” tikal/webcam-controller- model:latest TENSOR-FLOW TRAINING 3 Total data size: 332 4 Train X: (298, 7, 7, 256) 5 Train Y: (298, 2) 6 Test X: (34, 7, 7, 256) 7 Test Y: (34, 2) 8 Train on 298 samples, validate on 34 samples 9 Epoch 1/10 10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc: 0.9118 11 Epoch 2/10 12 298/298 [==============================] - 0s 1ms/step - loss: 0.1361 - acc: 0.9765 - val_loss: 0.0763 - val_acc: 1.0000 13 Epoch 3/10 14 298/298 [==============================] - 0s 1ms/step - loss: 0.0471 - acc: 0.9966 - val_loss: 0.0365 - val_acc: 1.0000 15 Epoch 4/10 16 298/298 [==============================] - 0s 1ms/step - loss: 0.0172 - acc: 1.0000 - val_loss: 0.0196 - val_acc: 1.0000 17 Epoch 5/10 18 298/298 [==============================] - 0s 1ms/step - loss: 0.0123 - acc: 1.0000 - val_loss: 0.0113 - val_acc: 1.0000 19 Epoch 6/10
  28. 28. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS DO I CARE ABOUT VENDOR LOCK-IN ?! - LET’S TALK MULTI-CLOUD my laptop 
 cloud I need CPU / GPU / TPU Adjust / Wrap our code to suit the Vendor TENSOR-FLOW TRAINING TENSOR-FLOW TRAINING TENSOR-FLOW TRAINING
  29. 29. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS IT’S NOT ONLY A MATTER OF VENDOR LOCK-IN! - IT’S MULTI-CLOUD Only in Google ATM CPU GPU TPU my laptop 
 cloud I need CPU / GPU / TPU
  30. 30. FullStack Developers Israel OPERATORS
  31. 31. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS TF [TENSORFLOW] OPERATOR
  32. 32. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS STAGE #4 - WRAP CODE TO SUPPORT WORKER | ADMIN | PS OPERATOR PATTERN ‣ docker run -v “${PWD}/data:/opt/data” tikal/webcam-controller- model:latest TENSOR-FLOW TRAINING 3 Total data size: 332 4 Train X: (298, 7, 7, 256) 5 Train Y: (298, 2) 6 Test X: (34, 7, 7, 256) 7 Test Y: (34, 2) 8 Train on 298 samples, validate on 34 samples 9 Epoch 1/10 10 298/298 [==============================] - 1s 3ms/step - loss: 0.5061 - acc: 0.7651 - val_loss: 0.2331 - val_acc: 0.9118 11 Epoch 2/10 12 298/298 [==============================] - 0s 1ms/step - loss: 0.1361 - acc: 0.9765 - val_loss: 0.0763 - val_acc: 1.0000 13 Epoch 3/10 14 298/298 [==============================] - 0s 1ms/step - loss: 0.0471 - acc: 0.9966 - val_loss: 0.0365 - val_acc: 1.0000 15 Epoch 4/10 16 298/298 [==============================] - 0s 1ms/step - loss: 0.0172 - acc: 1.0000 - val_loss: 0.0196 - val_acc: 1.0000 17 Epoch 5/10 18 298/298 [==============================] - 0s 1ms/step - loss: 0.0123 - acc: 1.0000 - val_loss: 0.0113 - val_acc: 1.0000 19 Epoch 6/10
  33. 33. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS ML/DL AS A SERVICE - ON YOUR INFRASTRUCTURE ‣ Package model ‣ Package configuration
  34. 34. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS PRE PACKAGE MODELS FOR TRAINING / SERVING ‣ Apply to Kubernetes via ksonnet
  35. 35. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS MODEL TRAINING DevEnv Push Tensorflow container to registry Create tfjob https://www.slideshare.net/barbarafusinska/hassle-free-scalable-machine-learning-learning-with-kubeflow https://codelabs.developers.google.com/codelabs/kubeflow-introduction/index.html?index=..%2F..%2Fio2018#2 Store Results
  36. 36. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS MODEL SERVING DevEnv Consume / Use model In local development Or in the Cloud Deploy app to K8s Use Results Push Application container to registry Use & Improve model
  37. 37. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS MODEL TRAINING & SERVING DevEnv Consume / Use model In local development Or in the Cloud Deploy app to K8s Use Results Push Application container to registry Use & Improve modelPush Tensorflow container to registry 1 2 3 4 Train model in Kubeflow Store Results 5 6 5
  38. 38. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS A/B TESTING DevEnv Consume / Use model In local development Or in the Cloud Deploy app to K8s Use Results Push Application container to registry Use & Improve model Push Tensorflow container to registry 1 2 3 4 Train model in Kubeflow Store Results 5 6 5 Use Ambassador for A/B testing 7
  39. 39. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS A ONE STOP SHOP FOR EVERYTHING … On Prem / 
 Cloud “PaaS" on K8s ▸ Job ▸ Cron Job ▸ POD ▸ Replica sets (multi-step / distributed)
  40. 40. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS TFJOB CRD - CUSTOM RESOURCE DEFINITION hagzag@model-tarining 👉 kubectl get tfjob NAME AGE wcm 1d
  41. 41. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS OUR IMAGE IN KUBEFLOW … … 11 clusterName: “minikube" 12 creationTimestamp: 2018-06-23T07:31:54Z 13 generation: 1 14 labels: 15 app.kubernetes.io/deploy-manager: ksonnet 16 name: wcm 17 namespace: wcm 18 resourceVersion: "94971" 19 selfLink: /apis/kubeflow.org/v1alpha1/namespaces/wcm/tfjobs/wcm 20 uid: 80ab9472-76b7-11e8-be6d-0800279cc216 21 spec: 22 RuntimeId: werb 23 replicaSpecs: 24 - replicas: 3 25 template: 26 metadata: 27 creationTimestamp: null 28 spec: 29 containers: 30 - image: tikal/webcam-controller-model:latest 31 name: tensorflow 32 resources: {} 33 restartPolicy: OnFailure 34 tfPort: 2222 35 tfReplicaType: WORKER 36 - replicas: 2 37 template: ‣ Next step is to wrap our model with some Operator / TF data so kubeflow can display it …
  42. 42. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS USE S3 AND TERNSORBAORD … ‣ Reuse training results and display in your common tensor-flow tooling.
  43. 43. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS WANT MORE ‣ Demo model -> https://github.com/tikalk/ webcam-controller-model ‣ Kubeflow - the main “engine” kubeflow.io ‣ It also supports other tools … 
 https://github.com/dwhitena/ kubeflow_pachyderm ‣ https://github.com/SeldonIO/seldon-core
  44. 44. FullStack Developers Israel MACHINE LEARNING | CONTINUOUS OPERATIONS EVEN MORE Preprocess | ingest data Serve Train Store
  45. 45. FullStack Developers Israel

×