SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Machine Learning on
Streaming Data
with Apache Kafka, Apache Beam, & TensorFlow
About Us
Mikhail Chrestkha
Machine Learning Specialist
Google Cloud
linkedin.com/in/mchrestkha
Stéphane Maarek
CEO & Kafka Instructor
DataCumulus
linkedin.com/in/stephanemaarek
Big Thanks to:
Julianne Cuneo
Big Data Specialist, Google Cloud
Kai Waehner
Technology Evangelist, Confluent
Agenda
1. Motivation
2. Architecture
3. Use Case Walk-Through w/ Demo
4. Summary
1 Motivation
Technology Landscape
Smart
Analytics
Streaming
InfoWorld’s 2019 Technology of the
Year Award Winners:
● Apache Beam
● Apache Kafka
● Elastic Stack
● DataStax Enterprise
● Firebase
● Horovod
● H2O Driverless AI
● Keras
● Kubernetes
● LLVM
● .Net Core
● PyTorch
● Redis
● TensorFlow
● Visual Studio Code
● XGBoost
Cloud
?
https://www.globenewswire.com/news-release/2019/01/30/1707685/0/en/InfoWorld-Announces-2019-Technology-of-the-Year-Award-Winners.html
Data Ingestion
Data Analysis &
Transformation
Trainer
Model Evaluation
& Validation
Serving
Notebook
Orchestration
ML Framework
ML Platform
OSS Managed Service
Apache Kafka
Event streaming platform
Confluent Cloud
Monitoring, Replication, Data Balancing
Apache Beam
Data processing pipelines
Unified batch & streaming
Dataflow
Automated resource management of workers
TensorFlow
Robust foundation for machine and
deep learning
Cloud Machine Learning Engine
● Training: Distributed training infrastructure that supports
CPUs, GPUs, and TPUs
● Serving: Host models for batch & online prediction
2 Architecture
Reference Kafka ML Architecture
● Data pipelines are simplified
● Building analytic modules is decoupled
from servicing them
● Usage of real time or batch as needed
● Analytic models can be deployed in a
performant, scalable and
mission-critical environment
Kai Waehner
Technology Evangelist, Confluent
https://www.confluent.io/blog/build-deploy-scalable-machine-learning-production-apache-kafka/
Confluent Cloud
Managed by Confluent Analytics, ML training & deployment path
ML serving path
Data warehouse
BigQuery
ML Training
Cloud ML Engine
Topic 1
Raw transaction
Topic 2
Predictions
Kafka
Cluster
Processing
Cloud Dataflow
Leverage managed services to simplify & focus on code not infrastructure
Producer
Consumer
Consumer
ML Notebook
KSQL
SQL Submit ML
Training jobs
ML Serving
Cloud ML Engine
ML notebook development / experimentation
Deploy
ML model
Automate w/
AirFlow
Dataflow
Template
3 Use Case Walk-Through
Kaggle Case Study
Fraud Detection of Credit Card Transactions
● Collect transaction data
● Analyze historical data
● Train model on historic sample
● Evaluate model based on precision & recall
● Predict fraud on new streaming data
492
Fraud (0.172%)
284,807
transactions
● Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In
Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015
● Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a
practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon
● Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning
strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE
○ Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)
● Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming
credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier
● Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection:
assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing
https://opendatacommons.org/licenses/dbcl/1.0/
DEMO 1 - 5 min
Sending our credit card data
Confluent Cloud, Creating a Topic, Python Script, Security
Kafka to BigQuery
Dataflow Template
Java Code
KafkaIO.<String, String>read() BigQueryIO.writeTableRows()
Create a template for easy re-usability by an analyst
Images from https://beam.apache.org/documentation/pipelines/design-your-pipeline/
redacted
Explore data & train ML model
from ksql import KSQLAPI
redacted
%%bigquery
redacted
gcloud ml-engine jobs
submit training
redacted
Query directly from topic Query petabytes of data Submit ML training job
DEMO 2 - 5 min
Dataflow template & job
Jupyter: KSQL, BQML, TensorFlow CMLE job
Send Predictions back to Kafka
Java Code
Cloud Machine Learning
Engine
Request Response
Hosted ML Model
Image from https://beam.apache.org/documentation/pipelines/design-your-pipeline/
Publish models
KafkaIO.<String, String>read() KafkaIO.<String, String>write()
Train
Model
DEMO 3 - 5 min
(1) Deploy model as an end point
(2) Prediction sent to Kafka topic to be consumed
(3) Track models & monitor predictions in CMLE UI
Futuristic Architecture: Pure Kafka-based ML
Resilient, highly available, sync & async
Confluent Cloud
Managed by Confluent
Topic 1
Raw transaction
Topic 2
Predictions
Producer
Consumer
Consumer
Kafka Streams ML
Synchronous
Application
training
serving
Interactive
Query API
gRPC or
REST API
Internal Topic
ML Model
(compacted?)
Model
state
4 Summary
Summary
● Kafka + Beam + TensorFlow = Great foundation
for future
○ Batch today → streaming tomorrow
○ Small data → big data tomorrow
○ Shallow learning today → deep learning tomorrow
● Make data & ML easier for yourself by using
managed services
● Build for many other use cases:
○ Predictive maintenance
○ Logistics routing
○ Image search & recommendations in e-commerce
Smart
AnalyticsStreaming
Cloud
Talk to Google Cloud
K1Booth
Learn More
Blog: Enabling connected
transformation with Apache Kafka
and TensorFlow on Google Cloud
Platform
bit.ly/2CHERol
KafkaIO on Beam
bit.ly/2YwL3Jc
KafkaToBigQuery Dataflow
Template Example
bit.ly/2HQqVN0
Contact us
linkedin.com/in/mchrestkha
linkedin.com/in/stephanemaarek
Confluent Cloud
Managed by Confluent Analytics, ML training & deployment path
ML serving path
Data warehouse
BigQuery
ML Training
Cloud ML Engine
Topic 1
Raw transaction
Topic 2
Predictions
Kafka
Cluster
Processing
Cloud Dataflow
Questions
Producer
Consumer
Consumer
Cloud ML Notebook
KSQL
SQL Submit ML
Training jobs
ML Serving
Cloud ML Engine
ML notebook development / experimentation
Deploy
ML model
Automate w/
AirFlow
Dataflow
Template

Mais conteúdo relacionado

Mais procurados

Veeam backup and replication
Veeam backup and replicationVeeam backup and replication
Veeam backup and replication
bluechipper
 
Web technology practical list
Web technology practical listWeb technology practical list
Web technology practical list
desaipratu10
 

Mais procurados (20)

5. Frames & Forms.pdf
5. Frames & Forms.pdf5. Frames & Forms.pdf
5. Frames & Forms.pdf
 
API Design - 3rd Edition
API Design - 3rd EditionAPI Design - 3rd Edition
API Design - 3rd Edition
 
Veeam backup and replication
Veeam backup and replicationVeeam backup and replication
Veeam backup and replication
 
Vertx
VertxVertx
Vertx
 
Presentation of bootstrap
Presentation of bootstrapPresentation of bootstrap
Presentation of bootstrap
 
VMWARE VS MS-HYPER-V
VMWARE VS MS-HYPER-VVMWARE VS MS-HYPER-V
VMWARE VS MS-HYPER-V
 
Bare Metal Cluster with Kubernetes, Istio and Metallb | Nguyen Phuong An, Ngu...
Bare Metal Cluster with Kubernetes, Istio and Metallb | Nguyen Phuong An, Ngu...Bare Metal Cluster with Kubernetes, Istio and Metallb | Nguyen Phuong An, Ngu...
Bare Metal Cluster with Kubernetes, Istio and Metallb | Nguyen Phuong An, Ngu...
 
Windows Azure Storage – Architecture View
Windows Azure Storage – Architecture ViewWindows Azure Storage – Architecture View
Windows Azure Storage – Architecture View
 
Reporting solutions for ADF Applications
Reporting solutions for ADF ApplicationsReporting solutions for ADF Applications
Reporting solutions for ADF Applications
 
JavaScript Tutorial
JavaScript  TutorialJavaScript  Tutorial
JavaScript Tutorial
 
Microsoft Windows Server 2012 R2 Hyper V server overview
Microsoft Windows Server 2012 R2 Hyper V server overviewMicrosoft Windows Server 2012 R2 Hyper V server overview
Microsoft Windows Server 2012 R2 Hyper V server overview
 
Web technology practical list
Web technology practical listWeb technology practical list
Web technology practical list
 
How to Use JSON in MySQL Wrong
How to Use JSON in MySQL WrongHow to Use JSON in MySQL Wrong
How to Use JSON in MySQL Wrong
 
Portfolio website
Portfolio websitePortfolio website
Portfolio website
 
Meet up roadmap cloudera 2020 - janeiro
Meet up   roadmap cloudera 2020 - janeiroMeet up   roadmap cloudera 2020 - janeiro
Meet up roadmap cloudera 2020 - janeiro
 
Containers and Kubernetes
Containers and KubernetesContainers and Kubernetes
Containers and Kubernetes
 
SRE & Kubernetes
SRE & KubernetesSRE & Kubernetes
SRE & Kubernetes
 
JSON: The Basics
JSON: The BasicsJSON: The Basics
JSON: The Basics
 
Load Balance with NSX-T.pptx
Load Balance with NSX-T.pptxLoad Balance with NSX-T.pptx
Load Balance with NSX-T.pptx
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 

Semelhante a Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail Chrestkha, Google Cloud; Stephane Maarek, DataCumulus) Kafka Summit NYC 2019

Yu_Wang_Resume
Yu_Wang_ResumeYu_Wang_Resume
Yu_Wang_Resume
Ryan Wang
 
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Kai Wähner
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
Stepan Pushkarev
 

Semelhante a Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail Chrestkha, Google Cloud; Stephane Maarek, DataCumulus) Kafka Summit NYC 2019 (20)

Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
DEVOPS AND MACHINE LEARNING
DEVOPS AND MACHINE LEARNINGDEVOPS AND MACHINE LEARNING
DEVOPS AND MACHINE LEARNING
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
 
Yu_Wang_Resume
Yu_Wang_ResumeYu_Wang_Resume
Yu_Wang_Resume
 
Deep learning for FinTech
Deep learning for FinTechDeep learning for FinTech
Deep learning for FinTech
 
Denis Jannot - Towards Data Science Engineering Principles - Codemotion Milan...
Denis Jannot - Towards Data Science Engineering Principles - Codemotion Milan...Denis Jannot - Towards Data Science Engineering Principles - Codemotion Milan...
Denis Jannot - Towards Data Science Engineering Principles - Codemotion Milan...
 
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
 
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
 
Production machine learning: Managing models, workflows and risk at scale
Production machine learning: Managing models, workflows and risk at scaleProduction machine learning: Managing models, workflows and risk at scale
Production machine learning: Managing models, workflows and risk at scale
 
Hyf project ideas_02
Hyf project ideas_02Hyf project ideas_02
Hyf project ideas_02
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
 

Mais de confluent

Mais de confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail Chrestkha, Google Cloud; Stephane Maarek, DataCumulus) Kafka Summit NYC 2019

  • 1. Machine Learning on Streaming Data with Apache Kafka, Apache Beam, & TensorFlow
  • 2. About Us Mikhail Chrestkha Machine Learning Specialist Google Cloud linkedin.com/in/mchrestkha Stéphane Maarek CEO & Kafka Instructor DataCumulus linkedin.com/in/stephanemaarek Big Thanks to: Julianne Cuneo Big Data Specialist, Google Cloud Kai Waehner Technology Evangelist, Confluent
  • 3. Agenda 1. Motivation 2. Architecture 3. Use Case Walk-Through w/ Demo 4. Summary
  • 5. Technology Landscape Smart Analytics Streaming InfoWorld’s 2019 Technology of the Year Award Winners: ● Apache Beam ● Apache Kafka ● Elastic Stack ● DataStax Enterprise ● Firebase ● Horovod ● H2O Driverless AI ● Keras ● Kubernetes ● LLVM ● .Net Core ● PyTorch ● Redis ● TensorFlow ● Visual Studio Code ● XGBoost Cloud ? https://www.globenewswire.com/news-release/2019/01/30/1707685/0/en/InfoWorld-Announces-2019-Technology-of-the-Year-Award-Winners.html
  • 6. Data Ingestion Data Analysis & Transformation Trainer Model Evaluation & Validation Serving Notebook Orchestration ML Framework ML Platform
  • 7. OSS Managed Service Apache Kafka Event streaming platform Confluent Cloud Monitoring, Replication, Data Balancing Apache Beam Data processing pipelines Unified batch & streaming Dataflow Automated resource management of workers TensorFlow Robust foundation for machine and deep learning Cloud Machine Learning Engine ● Training: Distributed training infrastructure that supports CPUs, GPUs, and TPUs ● Serving: Host models for batch & online prediction
  • 9. Reference Kafka ML Architecture ● Data pipelines are simplified ● Building analytic modules is decoupled from servicing them ● Usage of real time or batch as needed ● Analytic models can be deployed in a performant, scalable and mission-critical environment Kai Waehner Technology Evangelist, Confluent https://www.confluent.io/blog/build-deploy-scalable-machine-learning-production-apache-kafka/
  • 10. Confluent Cloud Managed by Confluent Analytics, ML training & deployment path ML serving path Data warehouse BigQuery ML Training Cloud ML Engine Topic 1 Raw transaction Topic 2 Predictions Kafka Cluster Processing Cloud Dataflow Leverage managed services to simplify & focus on code not infrastructure Producer Consumer Consumer ML Notebook KSQL SQL Submit ML Training jobs ML Serving Cloud ML Engine ML notebook development / experimentation Deploy ML model Automate w/ AirFlow Dataflow Template
  • 11. 3 Use Case Walk-Through
  • 12. Kaggle Case Study Fraud Detection of Credit Card Transactions ● Collect transaction data ● Analyze historical data ● Train model on historic sample ● Evaluate model based on precision & recall ● Predict fraud on new streaming data 492 Fraud (0.172%) 284,807 transactions ● Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015 ● Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon ● Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE ○ Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi) ● Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier ● Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing https://opendatacommons.org/licenses/dbcl/1.0/
  • 13. DEMO 1 - 5 min Sending our credit card data Confluent Cloud, Creating a Topic, Python Script, Security
  • 14. Kafka to BigQuery Dataflow Template Java Code KafkaIO.<String, String>read() BigQueryIO.writeTableRows() Create a template for easy re-usability by an analyst Images from https://beam.apache.org/documentation/pipelines/design-your-pipeline/ redacted
  • 15. Explore data & train ML model from ksql import KSQLAPI redacted %%bigquery redacted gcloud ml-engine jobs submit training redacted Query directly from topic Query petabytes of data Submit ML training job
  • 16. DEMO 2 - 5 min Dataflow template & job Jupyter: KSQL, BQML, TensorFlow CMLE job
  • 17. Send Predictions back to Kafka Java Code Cloud Machine Learning Engine Request Response Hosted ML Model Image from https://beam.apache.org/documentation/pipelines/design-your-pipeline/ Publish models KafkaIO.<String, String>read() KafkaIO.<String, String>write() Train Model
  • 18. DEMO 3 - 5 min (1) Deploy model as an end point (2) Prediction sent to Kafka topic to be consumed (3) Track models & monitor predictions in CMLE UI
  • 19. Futuristic Architecture: Pure Kafka-based ML Resilient, highly available, sync & async Confluent Cloud Managed by Confluent Topic 1 Raw transaction Topic 2 Predictions Producer Consumer Consumer Kafka Streams ML Synchronous Application training serving Interactive Query API gRPC or REST API Internal Topic ML Model (compacted?) Model state
  • 21. Summary ● Kafka + Beam + TensorFlow = Great foundation for future ○ Batch today → streaming tomorrow ○ Small data → big data tomorrow ○ Shallow learning today → deep learning tomorrow ● Make data & ML easier for yourself by using managed services ● Build for many other use cases: ○ Predictive maintenance ○ Logistics routing ○ Image search & recommendations in e-commerce Smart AnalyticsStreaming Cloud
  • 22. Talk to Google Cloud K1Booth Learn More Blog: Enabling connected transformation with Apache Kafka and TensorFlow on Google Cloud Platform bit.ly/2CHERol KafkaIO on Beam bit.ly/2YwL3Jc KafkaToBigQuery Dataflow Template Example bit.ly/2HQqVN0 Contact us linkedin.com/in/mchrestkha linkedin.com/in/stephanemaarek
  • 23. Confluent Cloud Managed by Confluent Analytics, ML training & deployment path ML serving path Data warehouse BigQuery ML Training Cloud ML Engine Topic 1 Raw transaction Topic 2 Predictions Kafka Cluster Processing Cloud Dataflow Questions Producer Consumer Consumer Cloud ML Notebook KSQL SQL Submit ML Training jobs ML Serving Cloud ML Engine ML notebook development / experimentation Deploy ML model Automate w/ AirFlow Dataflow Template