This session is continuation of “Automated Production Ready ML at Scale” in last Spark AI Summit at Europe. In this session you will learn about how H&M evolves reference architecture covering entire MLOps stack addressing a few common challenges in AI and Machine learning product, like development efficiency, end to end traceability, speed to production, etc.
1. Apply MLOps At Scale
Keven(Qi) Wang
Linkedin: https://www.linkedin.com/in/kevenqiwang/
Medium: https://medium.com/@kevenwang_33862
Lead AI Architect @ H&M
2. Agenda
AI journey @H&M
Quick facts and use cases
Reference Architecture gen1
ML process and ML training
Reference Architecture gen2
MLOps and Operationalize AI
5. Our Journey
2016
Exploration
Run initial PoCs
Test AA appetite &
applicability
2017
Initiation
Industrialize early use cases
Defining organization and
capability needs
Establishing the IT / data
environment
2018
Establish AA & AI
function
Roll-out & hand over of
successful pilots
Establishing AA-WoW,
team, governance
2019
AA Leader
Increasingly data &
algo-driven retail business
Analytical support
across entire value chain
Strong internal AA teams
Engage in partnership with
strong AI players
2022
AI Leader of the Fashion
Industry
Lead the frontier of AI at scale in
delivering customer value
Global leader in developing
talent pools and supporting
AI hubs and networks
AI-powered tools and capabilities
supporting core processes and business
decisions in all functions
World leading ecosystem of cutting edge
AI partners
Today
Algo library, IT platform, Business Impact
6. H&M use cases
Analytics and Data Platform
LogisticsProduction Sales MarketingDesign / Buying
Assortment quantification
Fashion Forecast
Allocation Markdown Online
Markdown Store
Personalized Promotions,
Recommendations &
Journeys
Movebox
Knowledge &
Best Practice
AI exploration
and Research
Rapid Dev
enablement
AI platform
7. AI @ H&M quick facts
100+ co-located
FTEs
Growing # of
colleagues
30+ different
nationalities
Several
nationalities
Combined
teams
Sprints
Standups
Product
mgmt.
Epics
Algo
Cloud
New ways of
working
Consultants
HAAL
Azure Databricks
10. ML Process and Tooling
Model Deployment
Model training
Data
acquisition
Data
preparation
Feature
Engineering
Model training
Model
repository
Unseen data
acquisition
Data
preparation
Transform
data into
feature
Model
prediction Results
Deployment orchestration
Datastorage
Training orchestration
Data Lake Store
Model and data versioning
Automated, e2e feedback loop
e2e monitoring
11. Interactive model development
Kubernetes
Container
Registry
Triggering
CI Orchestrator
Model
repository
Azure Databricks
1 Code commit
2 code static check,
unit test,
Packaging
3.2 Trigger pipeline
4.3 Commit model
5.1 Fetch model
5.2 Build container image
6 Push image
7 Auto deploy
PyCharm
3.1 Push
to DBFS
4.2 log model info
4.1 job execution
12. Automated model training pipeline 1
Scenario 1
• Geo location l1
• Product type p1
• Time t1
Scenario 2
• Geo location l2
• Product type p2
• Time t2
Scenario 3
• Geo location l3
• Product type p3
• Time t3
Scenario i
• Geo location li
• Product type pi
• Time ti
Scenario set
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Databricks Cluster
Databricks Cluster
Databricks Cluster
VM
VM
Container
13. Automated model training pipeline 2
Scenario
set
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
set
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
DAG
Scenario
set
Scenario 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario 2 Source
data
Prep data
Feature
engine…
Train Optimize
Scenario 3
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario i
Source
data
Prep data
Feature
engine…
Train Optimize
Databricks
Cluster
Databricks
Cluster
Databricks
Cluster
Azure Kubernetes Service
Container RegistryAirflow
Logs
Airflow
dags
Persistent Volume
Airflow
Webserver
Airflow
Scheduler
Kubernetes Pod
Azure File share
Airflow MetaDB
14. Trick for Airflow dependency challenge
Actual
python method
Little trick:
python_callable
Call the function
without import the
module
For more detail, check this blog post:
https://medium.com/@kevenwang_33862/machine-learning-in-production-2-large-scale-ml-training-889cde94f26d
15. 15
General Information
Evolve to scale and industrialize across H&M
Make AI available
for product teams
across H&M Group
Facilitate scalability and
specialization
Continue to build word-class AI
products, engines and core
components
Proven the value in use
case by use case
Now: to reach next level we
need to industrialize and
scale AI across H&M
19. Model development - Interactive VS Automated
▪ AI product lifecycle
▪ Notebook and Python modules
▪ Container as first class citizen
▪ Airflow VS Kubeflow
20. Model serving – deployment strategy
Router Model 1.1
Router
(canary)
Model 1.1
Model 1.2
Router
(shadow)
Model 1.1
Model 1.2
Router
Model A1
Model A2
Model A3
Router
Model A1
Model A2
Model A3
Reward
System
Release Strategies Experiment Strategy
A/B test
Experiment Strategy
Multi-armed Bandit
21. Model serving – Inference Graph
Router 1
(Multi-armed
Bandit)
Router 2
(A/B test)
Model B1
Model B2
Model A1
Model A2
Model A3
Input
Transformer
Output
Transformer
22. Model management and lifecycle
Staging ProductionModel AprovalBack TestModel Development
PR
pipeline
Back test
pipeline
Trainning
CI pipeline
CD – Staging
Pipeline
CD – prod
pipeline CI/CD pipeline
develop feature
Pull Req
Infra as code
#dev #stage #prod
Infra as code Infra as code
23. Take away
▪ Problem, Process and Architecture
▪ Platform approach
▪ Leverage cloud native service