6. Data Pipeline For Machine Learning
ETL
Data Exploration
Join / Sampling /
Feature Extraction
Split train, test Data set, etc.
Model Training
Model Saving,
Versioning, etc.
Model Deployment
(Online Serving)
7. • Expert of ML algorithms, models, libraries,
feature engineering.
• Need tools and platforms to gain insight of
data, build models productively. Create ML
pipeline (such as: data labeling,
transformation, etc.)
• Mostly familiar with Python, Spark, Hive,
etc.
• Not familiar with platform stuffs.
Data Scientist
8. Model Exploration
• Pre-process using Spark/Hive,
(or some small scale
alternatives)
• Experiment using sampled
dataset with notebooks.
(Single node)
• Experiment with full dataset to
get best results. (distributed)
What Data Scientist Expect? (Cont)
Reproducible Experiment
• Record parameters, code,
metrics of experiment
• Dependency management,
coding once, run everywhere.
• Easy to fine-tune parameters,
AutoML.
Model Management
• Easy to manage model, and
push to production
• Model assurance, monitoring
9. What Data Scientist NOT expect to know?
• Deep understanding of resource
management system concepts in
YARN/Kubernetes (how capacity
scheduler, k8s operations, etc.)
• Compute engine tuning (memory
configuration, shuffling
performance)
• Nitty gritty details of underlying
infra, it should just work
21. Current State
Submarine v0.1 released after Apache Hadoop v3.2.0
Submarine v0.2 released
PyTorch support
Thin and uber jar
Addded LinkedIn TonY execution runtime (Hadoop 2.7.3+compatability)
Zeppelin Submarine interpreter
Mini-submarine (All-In-One image)
22. Future
• We are working with Hadoop/Apache community to spin-off
Submarine to a new Apache project.
• Some more features we are working on
Task JIRA Target Version
Submarine Workbench (Web/Server) SUBMARINE-98/SUBMARINE-131 0.3.0
Submarine Kubernetes Support SUBMARINE-154 0.3.0
Metrics Support (Like MLflow) TBD TBD
24. Development Team
Community Members
Hadoop Community PMC & Committer
Zeppelin Community PMC & Committer
Cloudera Wangda Tan,Zhankun Tang,Sunil Govind …
NetEase Xun Liu, Quan Zhou
LinkedIn Keqiu Hu
Alibaba Jeff Zhang
Ke.com Guoxian Zhao,Feng Liu,Huiyang Jian
JD Wanqiang Ji
Dahua Linhao Zhu
25. NetEase
• One of the largest online
game/news/music provider in
China.
• 245 GPU Cluster runs
Submarine.
• One of the model built is
music recommendation
model which invoked
1B+/days.
Community Use Cases
LinkedIn
• 250+ GPU machines
• 500+ TensorFlow
trainings/day.
• Serves applications in
recommendation systems and
NLP.
• Collaboration on runtime and
SDK development.
Ke.com
• Largest online real-estate
brokerage website in China.
• 50+ GPU machines (includes
19 multi-v100 GPU machines),
based on Hadoop trunk
(3.3.0).
• Serves applications like
image/voice recognition, etc.
26. Thank you!
Please join the community!
Website:
https://hadoop.apache.org/submarine/
Weekly Community Meeting:
https://docs.google.com/document/d/1XkrcyVil_ORV1UP-
JhosGzK8qWGXXX3wuplo4RtC7u0/edit
Code:
https://github.com/apache/hadoop