SlideShare uma empresa Scribd logo
1 de 64
Baixar para ler offline
The Next Gen AI Infrastructure
for the Public AI Cloud
By Adam Gibson
Our community software gets
160,000 downloads per month,
used by teams in half of the
Fortune 500.
About Skymind
● Builds AI infrastructure for operating models in production.
● Allows model access from cloud, server, desktop, and mobile, providing tooling for models
such as revision history and accuracy monitoring over time.
● Created the widely used open-source AI framework Deeplearning4j, powering AI for large
enterprise globally, from banking to e-commerce.
SKIL:
ML and DL Model Server
SKIL Discover:
ML and DL Validation
& Training Tool
Products
The Hierarchy of AI
Some of the companies that own Core AI
technologies
Less than 5% of businesses
globally drives value from AI
The Hierarchy of AI
The Hierarchy of AI
Some of the companies that own Core AI
technologies
Less than 5% of businesses
globally drives value from AI
Pyramid of AI
● AI at most a buzzword.
● Lacks basic infrastructure to derive value from AI
such as basic IT infrastructure.
Pre Digital Transformation
● Executives still question value of AI for their
business. Often skeptical of benefits.
● Wants to see benefits almost immediately before
real investment.
LEVEL 4: Heard of AI
What is A.I.?
Some of the companies that own Core AI
technologies
Less than 5% of businesses
globally drives value from AI
Pyramid of AI
● Has static rules in place.
● Deployed dashboards and BI, calls it AI.
● Very little if any modern use of machine learning.
● If any machine learning at all, probably
has it as a checkbox more than capturing
value.
● May have a data scientist or 2 lacking
infrastructure to do job well.
Level 3: Everything’s AI
Some of the companies that own Core AI
technologies
Less than 5% of businesses
globally drives value from AI
Pyramid of AI
● Capturing value from machine learning.
● Produces models meaningful to business.
● Has centralized infrastructure for analyzing
data within line of business.
● Invested in AI but may not know total return
on investment.
● Often building models and running
experiments without oversight from
business.
● Uses but does not build own infrastructure.
Credit: Mckinsey Global Institute
Level 2: Adopted AI
Some of the companies that own Core AI
technologies
Less than 5% of businesses
globally drives value from AI
Pyramid of AI
● Has own AI tools written from scratch
● Often has products powered by AI
● Software is a core competency
● Often has AI R&D lab
● Probably sells cloud infrastructure or dev tools
● Often employs vast majority of AI talent
Level 1: Mastered AI
Components to Build AI Infrastructure
The Infrastructure
Platform-agnostic
● Public Cloud
● On-Prem
● Hybrid
● Embeddable
● Edge
ML algorithms and Infra should go to
wherever the data is and computer.
● Configurable
● Auto-scaling
● Legacy Integration
● Multi-Cloud Flexibility
Typical Development
DEFINE PROBLEM
ACQUIRE DATA
TRANSFORM DATA
TRAIN MODEL
VALIDATE MODEL
REPEAT
Data Storage
● As organizations prepare enterprise AI strategies and build the necessary
infrastructure, storage must be a top priority. That includes ensuring the
proper storage capacity, IOPS and reliability to deal with the massive data
amounts required for effective AI.
● AI applications depend on source data, so an organization needs to know
where the source data resides and how AI applications will use it.
● As databases grow over time, companies need to
monitor capacity and plan for expansion as
needed.
Networking Infrastructure
● In order to provide the high efficiency at scale required to support AI,
organizations will likely need to upgrade their networks.
● Scalability must be a high priority, and that will require high-bandwidth,
low-latency and creative architectures
● Intent-based networks that can anticipate network demands
or security threats and react in real-time.
Data Processing
● A CPU-based environment can handle basic AI workloads, but deep
learning involves multiple large data sets and deploying scalable neural
network algorithms. For that, CPU-based computing might not be
sufficient.
● Deploying GPUs enables organizations to optimize their data center
infrastructure and gain power efficiency.
Data Management and Governance
● Does the organization have the proper mechanisms in place to deliver data
in a secure and efficient manner to the users who need it?
● Should be accessible from a variety of endpoints, including mobile devices
via wireless networks.
● Data access controls: privacy and security issues
Model Training
Model Training
Main Steps
● Read Data from Source.
● Analyze with statistics and normalize for
Neural Network Input.
● Train by sending input into Neural
Network and calculating how to update
network weights by using Back-
propagation Algorithm.
● Repeat until model makes no more
improvements.
Problems
● Model learns better with large dataset.
In enterprise, sometimes this data
doesn’t fit on a single machine .
Model Training: Multi-Node Training Cluster
Scaled Out Training Cluster
Architecture
● Any midrange VM or dedicated machine for
Zookeeper
● 1 or more Multi-GPU systems (DGX class or
similar) for SKIL
● Gluster/HDFS provides global file system for
data
Model Training: Hybrid Cloud
GPU Training Cluster
Architecture
● GPU Cluster (i.e. DGX-1 servers)
● Existing Hadoop cluster is used for
○ ETL (Preparing data for training on GPU) or
○ Batch Inference for distributed scoring with
trained models.
Model Training: Multi Cluster
GPU Training Cluster
CPU Inference Cluster
Architecture
● Powerful GPU Servers or Spark Cluster for training
models.
● Separate (multiple) deployments-only clusters for
production deployments of ML models as REST
APIs.
Model Training: Batch Training with Spark
The flow largely divided into two
stages:
● Scheduling: Launch executors
through cluster manager
● Execution: Manage executors to
perform task
Model Training: Batch Training with Spark
AI
Model Training: Work Distribution Across Executors
Model Training: Single Machine vs Spark Cluster
• Total runtime on cluster (including
evaluation) was about ~1.1 hours
• Linear scaling over dozens of nodes in
Spark cluster
Model Deployment (Inferences)
Model Deployment: The Applications
REST API
RPA
Application
Model Deployment: Deployment | Application
Model Deployment: Deployments
● Manage Model Deployment through API: Inspecting, updating, removing
models and deployments
GET, POST, or DELETE /deployments
● Each deployment can be assigned to an ID, i.e. “deploymentID” - you
can GET, POST, or DELETE by referencing this ID.
GET, POST, or DELETE /models
● Each model can be assigned an ID,
i.e. “modelID” - you can GET, POST,
or DELETE by referencing this ID.
Model Deployment: Inference
Real-Time (REST Endpoint)
● Standard RESTful API. All requests and responses use the ubiquitous JSON format. Our model
server also supports binary multi-part uploads of images in their compressed representation
minimizing network overhead.
Transform Endpoints
● Allow deploy previously defined transforms to enable distribution in a microservice architecture
(CSV or Image only). The transform is exposed as its own independent endpoint
KNN Endpoints
● Support uploading a series of vectors and looking up their nearest neighbors for recommendation
or clustering use-cases. This is implemented in an efficient manner with the VPTree data structure.
Model Deployment: Batch Inference with Spark
Provides a batch inference feature
through its “Context” for running local
inferences on data stored in your
Hadoop/Spark clusters, minimizing
data movement.
Model Deployment: Batch Inference with Spark
AI
Layer
Model Deployment: Asynchronous (Message Queue/Webhook)
● State of the art model server can receive requests from Message Queues like Kafka or RabbitMQ
to provide high-throughput near-realtime predictions.
● Message Queue is an asynchronous service-to-service communication
● Messages are stored in the queue until they are processed and deleted
● Message queue can be configured to be use on
○ New model server for reading data and storing inference results
○ Notebook to gather data from an incoming feedback queue and new data queue
Model Deployment: Asynchronous (Message Queue/Webhook)
● Apache Kafka is a data streaming platform with publish-subscribe messaging pattern.
● Topic is the queue of messages where it is broken into partitions for speed, scalability and size.
Apache Kafka Cluster
Partition 2
Partition 1
Topic
Model Deployment: | Publish-Subscribe Model in Kafka
Data Streams (Websites -> Model Servers)
Data Streams (Model Servers -> Websites)
Kafka Cluster
Topic
..
Website 1
Website 2
Consumers
.
.
.
Model Server 1
Model Server 2
Producers
PublishSubscribe
Kafka Cluster
Topic
..
Website 1
Website 2
Producers
.
.
.
Model Server 1
Model Server 2
Consumers
SubscribePublish
Model Management
Inferences - Traditional Way
Manually invoking jobs and handling model deployment - making model management difficult.
Inferences - Key Component of Model Management
Jobs Model History Server Deployment Server
SKIL servers inside a tenant are triggered
to run job/scripts with specific parameters
on tenant resources.
Keeps lists of models with performance
results. APIs allow them to be compared to
report the best models for deployment.
Deployment server handles
deployment, scaling and versioning
of models.
Real-time feedback requests
are stored back in DB to monitor
model performance on real data stream
Best model on real data can
Be used with transfer learning
To fine tune model with latest
data.
Inferences - Model Management
Inferences - Model Portfolio
Each portfolio need comprised of
● Deployed model
● Model versioning information
● Performances over time
● Log files
Benefits
● Compliant with GDPR
● Control the granularity of each portfolio
● Track concept drift
Performance Optimization
Performance: Goal
To be the most flexible and the highest performance
model server available, while also being memory efficient,
allowing for higher model-to-server ratios.
Performance: Key for Big Data Clusters
● Javacpp for memory management
● We have our own Garbage Collection for
CUDA and CUDNN as well (JIT collection on
GPU by tracking references via JVM)
● If on cluster Run everything as spark job
● Works with Keras Imported models
● Runs a parameter server for gradient
sharing with near linear scaling
performance
Inferences: Server Performance
Python’s servers are bottlenecked by Python’s GIL and are essentially single-threaded. Many
implementations process request 1 inputs at a time
Inferences: Server Performance
If you run multiple Python servers to overcome the GIL, you get uncoordinated and delayed responses
time because the processes compete for CPU/GPU.
Inferences Topology
Assessing the performance of your production
cluster requires analyzing the entire topology.
Trade-offs and design decisions can impact
your latency and hardware requirements.
For example, deploying a simple neural network
can significantly impact your cost efficiency on
GPU hardware. Also input data size can add
significant network latency.
Components of Latency
● Input data source gathering
● Transform data in to suitable representation (all numerical representations) for scoring
● NDArray creation on GPUs or in Memory
● Run ndarray through neural network (feedforward)
● Interpret output
Externalities not covered above include SSD vs. HDD, network overhead, network hardware,
virtualization vs. bare metal, Docker’s network host, additional load balancers...
Objective Oriented Infrastructure
The State of The Art Model
Configure Models
Tensorflow, Deeplearning4j, Keras
Train Models
GPU, CPU, Local, Distributed
Deploy
Single Machine/ Cluster,
HTTP API
Import Model
Tensorflow, Deeplearning4j,
PyTorch, Caffe, Keras
Record Feedback
Model History Server
Credit: Mckinsey Global Institute
● Use Cases/Sources of value
● Data Ecosystem
● Techniques and Tools
● Workflow Integration
● Open culture and organization
Going up in Level: Components of AI
Credit: CBS Insights
● Use cases are what maps value of AI to
line of business
● Often well understood per vertical, but not
clear how to map to specific company
● Companies often lacking data collection
needed to implement standard use cases
● Hard to map use case on to
implementation
Expectations on AI Use Cases
● Executives not sure on value of AI
● Often pre digital transformation (scattered
IT infrastructure)
● Often expect ROI while allocating minimal
cost towards innovation
● Need education on even the most basic
applications of AI
Problems in the Industry Today for Laggards
Credit: CBS Insights
● Big focus on educating the market
(if vendor).
● Scaling requirements are just now being
understood.
● Often only developers making decisions
rather than line of business; leads to R&D
focus rather than business value.
● Still not enough developers for serving all
AI needs.
Problems in the Industry Today for Innovators
Credit: CBS Insights
Towards a more Integrated Approach
Through Gradual Adoption
● Minimize time to value through direct
integration in business processes (RPA).
● Manage models deployed from day 1 to
track ROI on experiments to minimize risk of
AI adoption and bound spending.
● Provide standardized tooling across an
organization to break down silos.
● Focus on continuous education of end users
and AI stakeholders for ever changing
market needs.
Goals
Credit: CBS Insights
USE CASE:
Building Complex Machine Learning Solution with RPA
USE CASE: Model Management
Building Complex Machine Learning Solutions with
Robotic Process Automation (RPA)
RPA Application
RPA Application
How AI Service System works with RPA
AI Service System
Thank You

Mais conteúdo relacionado

Mais procurados

Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 

Mais procurados (20)

How To Do A Project
How To Do A ProjectHow To Do A Project
How To Do A Project
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Deep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and sparkDeep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and spark
 
Introduction To TensorFlow
Introduction To TensorFlowIntroduction To TensorFlow
Introduction To TensorFlow
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
Anomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) EnglishAnomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) English
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
 
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ... Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning Classifiers
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
 
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
 
Developing Recommendation System to provide a Personalized Learning experienc...
Developing Recommendation System to provide a PersonalizedLearning experienc...Developing Recommendation System to provide a PersonalizedLearning experienc...
Developing Recommendation System to provide a Personalized Learning experienc...
 
Conversational AI with Transformer Models
Conversational AI with Transformer ModelsConversational AI with Transformer Models
Conversational AI with Transformer Models
 
Week 4 advanced labeling, augmentation and data preprocessing
Week 4   advanced labeling, augmentation and data preprocessingWeek 4   advanced labeling, augmentation and data preprocessing
Week 4 advanced labeling, augmentation and data preprocessing
 
Deep Learning with Microsoft R Open
Deep Learning with Microsoft R OpenDeep Learning with Microsoft R Open
Deep Learning with Microsoft R Open
 
Squeezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile PhonesSqueezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile Phones
 

Semelhante a World Artificial Intelligence Conference Shanghai 2018

Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
doppenhe
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
vitm11
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edge
Conference Papers
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
PyData
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspective
Walid Shaari
 

Semelhante a World Artificial Intelligence Conference Shanghai 2018 (20)

Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
 
Google cloud Study Jam 2023.pptx
Google cloud Study Jam 2023.pptxGoogle cloud Study Jam 2023.pptx
Google cloud Study Jam 2023.pptx
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edge
 
Integrate Machine Learning into Your Spring Application in Less than an Hour
Integrate Machine Learning into Your Spring Application in Less than an HourIntegrate Machine Learning into Your Spring Application in Less than an Hour
Integrate Machine Learning into Your Spring Application in Less than an Hour
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Data meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow IndiaData meets AI - ATP Roadshow India
Data meets AI - ATP Roadshow India
 
For linked in part 2 no template
For linked in part 2  no templateFor linked in part 2  no template
For linked in part 2 no template
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspective
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
 
Path to continuous delivery
Path to continuous deliveryPath to continuous delivery
Path to continuous delivery
 
Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18
 
Securing your Machine Learning models
Securing your Machine Learning modelsSecuring your Machine Learning models
Securing your Machine Learning models
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 

Mais de Adam Gibson

Mais de Adam Gibson (20)

End to end MLworkflows
End to end MLworkflowsEnd to end MLworkflows
End to end MLworkflows
 
Boolan machine learning summit
Boolan machine learning summitBoolan machine learning summit
Boolan machine learning summit
 
Advanced deeplearning4j features
Advanced deeplearning4j featuresAdvanced deeplearning4j features
Advanced deeplearning4j features
 
Deep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the BayDeep Learning with GPUs in Production - AI By the Bay
Deep Learning with GPUs in Production - AI By the Bay
 
Big Data Analytics Tokyo
Big Data Analytics TokyoBig Data Analytics Tokyo
Big Data Analytics Tokyo
 
Wrangleconf Big Data Malaysia 2016
Wrangleconf Big Data Malaysia 2016Wrangleconf Big Data Malaysia 2016
Wrangleconf Big Data Malaysia 2016
 
Distributed deep rl on spark strata singapore
Distributed deep rl on spark   strata singaporeDistributed deep rl on spark   strata singapore
Distributed deep rl on spark strata singapore
 
Dl4j in the wild
Dl4j in the wildDl4j in the wild
Dl4j in the wild
 
SKIL - Dl4j in the wild meetup
SKIL - Dl4j in the wild meetupSKIL - Dl4j in the wild meetup
SKIL - Dl4j in the wild meetup
 
Strata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on SparkStrata Beijing - Deep Learning in Production on Spark
Strata Beijing - Deep Learning in Production on Spark
 
Skymind - Udacity China presentation
Skymind - Udacity China presentationSkymind - Udacity China presentation
Skymind - Udacity China presentation
 
Anomaly Detection in Deep Learning (Updated)
Anomaly Detection in Deep Learning (Updated)Anomaly Detection in Deep Learning (Updated)
Anomaly Detection in Deep Learning (Updated)
 
Hadoop summit 2016
Hadoop summit 2016Hadoop summit 2016
Hadoop summit 2016
 
Anomaly detection in deep learning
Anomaly detection in deep learningAnomaly detection in deep learning
Anomaly detection in deep learning
 
Brief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep LearningBrief introduction to Distributed Deep Learning
Brief introduction to Distributed Deep Learning
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learning
 
Skymind Open Power Summit ISV Round Table
Skymind Open Power Summit ISV Round TableSkymind Open Power Summit ISV Round Table
Skymind Open Power Summit ISV Round Table
 
Recurrent nets and sensors
Recurrent nets and sensorsRecurrent nets and sensors
Recurrent nets and sensors
 
Future of ai on the jvm
Future of ai on the jvmFuture of ai on the jvm
Future of ai on the jvm
 
Productionizing dl from the ground up
Productionizing dl from the ground upProductionizing dl from the ground up
Productionizing dl from the ground up
 

Último

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Último (20)

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 

World Artificial Intelligence Conference Shanghai 2018

  • 1. The Next Gen AI Infrastructure for the Public AI Cloud By Adam Gibson
  • 2. Our community software gets 160,000 downloads per month, used by teams in half of the Fortune 500. About Skymind ● Builds AI infrastructure for operating models in production. ● Allows model access from cloud, server, desktop, and mobile, providing tooling for models such as revision history and accuracy monitoring over time. ● Created the widely used open-source AI framework Deeplearning4j, powering AI for large enterprise globally, from banking to e-commerce. SKIL: ML and DL Model Server SKIL Discover: ML and DL Validation & Training Tool Products
  • 4. Some of the companies that own Core AI technologies Less than 5% of businesses globally drives value from AI The Hierarchy of AI
  • 6. Some of the companies that own Core AI technologies Less than 5% of businesses globally drives value from AI Pyramid of AI
  • 7. ● AI at most a buzzword. ● Lacks basic infrastructure to derive value from AI such as basic IT infrastructure. Pre Digital Transformation ● Executives still question value of AI for their business. Often skeptical of benefits. ● Wants to see benefits almost immediately before real investment. LEVEL 4: Heard of AI What is A.I.?
  • 8. Some of the companies that own Core AI technologies Less than 5% of businesses globally drives value from AI Pyramid of AI
  • 9. ● Has static rules in place. ● Deployed dashboards and BI, calls it AI. ● Very little if any modern use of machine learning. ● If any machine learning at all, probably has it as a checkbox more than capturing value. ● May have a data scientist or 2 lacking infrastructure to do job well. Level 3: Everything’s AI
  • 10. Some of the companies that own Core AI technologies Less than 5% of businesses globally drives value from AI Pyramid of AI
  • 11. ● Capturing value from machine learning. ● Produces models meaningful to business. ● Has centralized infrastructure for analyzing data within line of business. ● Invested in AI but may not know total return on investment. ● Often building models and running experiments without oversight from business. ● Uses but does not build own infrastructure. Credit: Mckinsey Global Institute Level 2: Adopted AI
  • 12. Some of the companies that own Core AI technologies Less than 5% of businesses globally drives value from AI Pyramid of AI
  • 13. ● Has own AI tools written from scratch ● Often has products powered by AI ● Software is a core competency ● Often has AI R&D lab ● Probably sells cloud infrastructure or dev tools ● Often employs vast majority of AI talent Level 1: Mastered AI
  • 14. Components to Build AI Infrastructure
  • 15. The Infrastructure Platform-agnostic ● Public Cloud ● On-Prem ● Hybrid ● Embeddable ● Edge ML algorithms and Infra should go to wherever the data is and computer. ● Configurable ● Auto-scaling ● Legacy Integration ● Multi-Cloud Flexibility
  • 16. Typical Development DEFINE PROBLEM ACQUIRE DATA TRANSFORM DATA TRAIN MODEL VALIDATE MODEL REPEAT
  • 17. Data Storage ● As organizations prepare enterprise AI strategies and build the necessary infrastructure, storage must be a top priority. That includes ensuring the proper storage capacity, IOPS and reliability to deal with the massive data amounts required for effective AI. ● AI applications depend on source data, so an organization needs to know where the source data resides and how AI applications will use it. ● As databases grow over time, companies need to monitor capacity and plan for expansion as needed.
  • 18. Networking Infrastructure ● In order to provide the high efficiency at scale required to support AI, organizations will likely need to upgrade their networks. ● Scalability must be a high priority, and that will require high-bandwidth, low-latency and creative architectures ● Intent-based networks that can anticipate network demands or security threats and react in real-time.
  • 19. Data Processing ● A CPU-based environment can handle basic AI workloads, but deep learning involves multiple large data sets and deploying scalable neural network algorithms. For that, CPU-based computing might not be sufficient. ● Deploying GPUs enables organizations to optimize their data center infrastructure and gain power efficiency.
  • 20. Data Management and Governance ● Does the organization have the proper mechanisms in place to deliver data in a secure and efficient manner to the users who need it? ● Should be accessible from a variety of endpoints, including mobile devices via wireless networks. ● Data access controls: privacy and security issues
  • 22. Model Training Main Steps ● Read Data from Source. ● Analyze with statistics and normalize for Neural Network Input. ● Train by sending input into Neural Network and calculating how to update network weights by using Back- propagation Algorithm. ● Repeat until model makes no more improvements. Problems ● Model learns better with large dataset. In enterprise, sometimes this data doesn’t fit on a single machine .
  • 23. Model Training: Multi-Node Training Cluster Scaled Out Training Cluster Architecture ● Any midrange VM or dedicated machine for Zookeeper ● 1 or more Multi-GPU systems (DGX class or similar) for SKIL ● Gluster/HDFS provides global file system for data
  • 24. Model Training: Hybrid Cloud GPU Training Cluster Architecture ● GPU Cluster (i.e. DGX-1 servers) ● Existing Hadoop cluster is used for ○ ETL (Preparing data for training on GPU) or ○ Batch Inference for distributed scoring with trained models.
  • 25. Model Training: Multi Cluster GPU Training Cluster CPU Inference Cluster Architecture ● Powerful GPU Servers or Spark Cluster for training models. ● Separate (multiple) deployments-only clusters for production deployments of ML models as REST APIs.
  • 26. Model Training: Batch Training with Spark The flow largely divided into two stages: ● Scheduling: Launch executors through cluster manager ● Execution: Manage executors to perform task
  • 27. Model Training: Batch Training with Spark AI
  • 28. Model Training: Work Distribution Across Executors
  • 29. Model Training: Single Machine vs Spark Cluster • Total runtime on cluster (including evaluation) was about ~1.1 hours • Linear scaling over dozens of nodes in Spark cluster
  • 31. Model Deployment: The Applications REST API RPA Application
  • 33. Model Deployment: Deployments ● Manage Model Deployment through API: Inspecting, updating, removing models and deployments GET, POST, or DELETE /deployments ● Each deployment can be assigned to an ID, i.e. “deploymentID” - you can GET, POST, or DELETE by referencing this ID. GET, POST, or DELETE /models ● Each model can be assigned an ID, i.e. “modelID” - you can GET, POST, or DELETE by referencing this ID.
  • 34. Model Deployment: Inference Real-Time (REST Endpoint) ● Standard RESTful API. All requests and responses use the ubiquitous JSON format. Our model server also supports binary multi-part uploads of images in their compressed representation minimizing network overhead. Transform Endpoints ● Allow deploy previously defined transforms to enable distribution in a microservice architecture (CSV or Image only). The transform is exposed as its own independent endpoint KNN Endpoints ● Support uploading a series of vectors and looking up their nearest neighbors for recommendation or clustering use-cases. This is implemented in an efficient manner with the VPTree data structure.
  • 35. Model Deployment: Batch Inference with Spark Provides a batch inference feature through its “Context” for running local inferences on data stored in your Hadoop/Spark clusters, minimizing data movement.
  • 36. Model Deployment: Batch Inference with Spark AI Layer
  • 37. Model Deployment: Asynchronous (Message Queue/Webhook) ● State of the art model server can receive requests from Message Queues like Kafka or RabbitMQ to provide high-throughput near-realtime predictions. ● Message Queue is an asynchronous service-to-service communication ● Messages are stored in the queue until they are processed and deleted ● Message queue can be configured to be use on ○ New model server for reading data and storing inference results ○ Notebook to gather data from an incoming feedback queue and new data queue
  • 38. Model Deployment: Asynchronous (Message Queue/Webhook) ● Apache Kafka is a data streaming platform with publish-subscribe messaging pattern. ● Topic is the queue of messages where it is broken into partitions for speed, scalability and size. Apache Kafka Cluster Partition 2 Partition 1 Topic
  • 39. Model Deployment: | Publish-Subscribe Model in Kafka Data Streams (Websites -> Model Servers) Data Streams (Model Servers -> Websites) Kafka Cluster Topic .. Website 1 Website 2 Consumers . . . Model Server 1 Model Server 2 Producers PublishSubscribe Kafka Cluster Topic .. Website 1 Website 2 Producers . . . Model Server 1 Model Server 2 Consumers SubscribePublish
  • 41. Inferences - Traditional Way Manually invoking jobs and handling model deployment - making model management difficult.
  • 42. Inferences - Key Component of Model Management Jobs Model History Server Deployment Server SKIL servers inside a tenant are triggered to run job/scripts with specific parameters on tenant resources. Keeps lists of models with performance results. APIs allow them to be compared to report the best models for deployment. Deployment server handles deployment, scaling and versioning of models. Real-time feedback requests are stored back in DB to monitor model performance on real data stream Best model on real data can Be used with transfer learning To fine tune model with latest data.
  • 43. Inferences - Model Management
  • 44. Inferences - Model Portfolio Each portfolio need comprised of ● Deployed model ● Model versioning information ● Performances over time ● Log files Benefits ● Compliant with GDPR ● Control the granularity of each portfolio ● Track concept drift
  • 46. Performance: Goal To be the most flexible and the highest performance model server available, while also being memory efficient, allowing for higher model-to-server ratios.
  • 47. Performance: Key for Big Data Clusters ● Javacpp for memory management ● We have our own Garbage Collection for CUDA and CUDNN as well (JIT collection on GPU by tracking references via JVM) ● If on cluster Run everything as spark job ● Works with Keras Imported models ● Runs a parameter server for gradient sharing with near linear scaling performance
  • 48. Inferences: Server Performance Python’s servers are bottlenecked by Python’s GIL and are essentially single-threaded. Many implementations process request 1 inputs at a time
  • 49. Inferences: Server Performance If you run multiple Python servers to overcome the GIL, you get uncoordinated and delayed responses time because the processes compete for CPU/GPU.
  • 50. Inferences Topology Assessing the performance of your production cluster requires analyzing the entire topology. Trade-offs and design decisions can impact your latency and hardware requirements. For example, deploying a simple neural network can significantly impact your cost efficiency on GPU hardware. Also input data size can add significant network latency.
  • 51. Components of Latency ● Input data source gathering ● Transform data in to suitable representation (all numerical representations) for scoring ● NDArray creation on GPUs or in Memory ● Run ndarray through neural network (feedforward) ● Interpret output Externalities not covered above include SSD vs. HDD, network overhead, network hardware, virtualization vs. bare metal, Docker’s network host, additional load balancers...
  • 53. The State of The Art Model Configure Models Tensorflow, Deeplearning4j, Keras Train Models GPU, CPU, Local, Distributed Deploy Single Machine/ Cluster, HTTP API Import Model Tensorflow, Deeplearning4j, PyTorch, Caffe, Keras Record Feedback Model History Server
  • 54. Credit: Mckinsey Global Institute ● Use Cases/Sources of value ● Data Ecosystem ● Techniques and Tools ● Workflow Integration ● Open culture and organization Going up in Level: Components of AI
  • 55. Credit: CBS Insights ● Use cases are what maps value of AI to line of business ● Often well understood per vertical, but not clear how to map to specific company ● Companies often lacking data collection needed to implement standard use cases ● Hard to map use case on to implementation Expectations on AI Use Cases
  • 56. ● Executives not sure on value of AI ● Often pre digital transformation (scattered IT infrastructure) ● Often expect ROI while allocating minimal cost towards innovation ● Need education on even the most basic applications of AI Problems in the Industry Today for Laggards Credit: CBS Insights
  • 57. ● Big focus on educating the market (if vendor). ● Scaling requirements are just now being understood. ● Often only developers making decisions rather than line of business; leads to R&D focus rather than business value. ● Still not enough developers for serving all AI needs. Problems in the Industry Today for Innovators Credit: CBS Insights
  • 58. Towards a more Integrated Approach Through Gradual Adoption
  • 59. ● Minimize time to value through direct integration in business processes (RPA). ● Manage models deployed from day 1 to track ROI on experiments to minimize risk of AI adoption and bound spending. ● Provide standardized tooling across an organization to break down silos. ● Focus on continuous education of end users and AI stakeholders for ever changing market needs. Goals Credit: CBS Insights
  • 60. USE CASE: Building Complex Machine Learning Solution with RPA
  • 61. USE CASE: Model Management
  • 62. Building Complex Machine Learning Solutions with Robotic Process Automation (RPA) RPA Application RPA Application
  • 63. How AI Service System works with RPA AI Service System