Tianqi holds a bachelor’s degree in Computer Science from Shanghai Jiao Tong University, where he was a member of ACM Class, now part of Zhiyuan College in SJTU. He did his master’s degree at Changhai Jiao Tong University in China on Apex Data and Knowledge Management before joining the University of Washington as a PhD. He has had several prestigious internships and has been a visiting scholar including: Google on the Brain Team, at Graphlab authoring the boosted tree and neural net toolkit, at Microsoft Research Asia in the Machine Learning Group, and the Digital Enterprise Institute in Galway Ireland. What really excites Tianqi is what processes and goals can be enabled when we bring advanced learning techniques and systems together. He pushes the envelope on deep learning, knowledge transfer and lifelong learning. His PhD is supported by a Google PhD Fellowship.
Abstract summary
Build Scalable and Modular Learning Systems:
Machine learning and data-driven approaches are becoming very important in many areas. There are one factors that drive these successful applications: scalable learning systems that learn the model of interest from large datasets. More importantly, the system needed to be designed in a modular way to work with existing ecosystem and improve users’ productivity environment. In this talk, I will talk about XGBoost and MXNet, two learning scalable and portable systems that I build. I will discuss how we can apply distributed computing, asynchronous scheduling and hardware acceleration to improve these systems, as well as how do they fit into bigger open-source ecosystem of machine learning.
4. A Method to Solve Half of the Problems
age < 15
is male?
+2 -1+0.1
Y N
Y N
Use Computer
Daily
Y N
+0.9 -0.9
f( ) = 2 + 0.9= 2.9 f( )= -1 - 0.9= -1.9
Tree Boosting (Friedman 1999)
Used by 17 out of 29 Kaggle winners last
year and more, winning solutions for the
all the problems the last slide
All use
7. Fast Histogram-based Trees
• Bring techniques of recent improvements in histogram based
tree construction to XGBoost
• FastBDT (Thomas Keck), LightGBM (Ke et.al)
• Optimized for both categorical and continuous features.
Contributed by Hyunsu Cho,
University of Washington
8. GPU based Optimization
• Run each boosting iteration on the GPU
• Uses fast parallel prefix sum / radix sort operations
• Available now in XGboost
Dataset i7-6700K (s) Titan X (s) Speedup
Yahoo LTR 3738 507 7.37
Higgs 31352 4173 7.51
Bosch 9460 1009 9.38
Contributed by Rory Mitchell,
Waikato University
9. Modularity: Platform Agnostic Engine
In any language
On any Platform
• YARN, MPI, Flink, Spark, ...
• Easily extendible to
other cloud data flow
12. Declarative vs Imperative Programs
• Declarative graphs are easy to store, port, and optimize
• Theano, TensorFlow
• Imperative programs are flexible but hard to optimize
• PyTorch, Chainer, Numpy
13. MXNet’s Approach: Mixed Programming
Imperative
NDArray API
>>> import mxnet as mx
>>> a = mx.nd.zeros((100, 50))
>>> a.shape
(100L, 50L)
>>> b = mx.nd.ones((100, 50))
>>> c = a + b
>>> b += c
>>> import mxnet as mx
>>> net = mx.symbol.Variable('data')
>>> net = mx.symbol.FullyConnected(data=net, num_hidden=128)
>>> net = mx.symbol.SoftmaxOutput(data=net)
>>> type(net)
<class ‘mxnet.symbol.Symbol’>
>>> texec = net.simple_bind(data=data_shape)
Declarative API
15. • Speed is critical to deep learning
• Parallelism leads to higher performance
• Parallelization across multiple GPUs
• Parallel execution of small kernels
• Overlapping memory/networking transfer and computation
• …
Need for Parallelism
17. Solution: Auto Parallelization with
Dependency Engine
• Single thread abstraction of parallel environment
• Works for both symbolic and imperative programs
Dependency
Engine
18. Scaling up to 256 AWS GPUs
• Weak scaling (fix batch-size per GPU)
• Need to tune different optimal
parameters with more GPUs
• Larger learning rate
• More noised augmentation
Bias variance trade-off
https://github.com/dmlc/mxnet/tree/master/example/image-classification#scalability-results
Adopted as AWS’s choice of deep learning system
19. Scaling Up is Good, How about Big Model?
Many model is bounded by memory
21. Trade Computation for Memory
• Do re-computation instead of saving
• Training Deep Nets with Sublinear Memory Cost Chen.et.al arXiv 1604.06174
• Memory-Efficient Backpropagation Through Time Gruslys.et.al arXiv:1606.03401
22. O(sqrt(N)) memory cost with 25% overhead
ImageNet ResNet configurations
Train Bigger models on a single GPU
24. Deep Learning System will Become more
Heterogeneous
Mobile System A
Operators of A
Graph,
without gradient
System B
Code generators
Computation Graph Def,
Gradient and Execution B
Front-end
Operators B
System C
Operators of C
Front-end
• More heterogeneous
• Need different system for specific cases(with common modules)
25. Unix Philosophy vs Monolithic System
• Monolithic: Build one system to solve everything
• Unix Philosophy: Build modules to solve one thing well, work with
other pieces
26. NNVM: High Level Graph Optimization for
Deep learning
• Allow different front-ends and back-ends
• Allow extensive optimizations:
• Memory reusing.
• Runtime kernel fusion.
• Automatic tensor partition and placement.
• …
Lightweight
27. The Challenge for IR of Deep Learning Systems
FGradient FInferShape
Conv Relu Symbolic
Differentiation
Shape
Inference
Operators
Optimization Passes
Use Set of Common Operator Attributes
BatchNorm
The need for adding new operators
Code Gen
The need for adding new optimizations
FCodeGen
28. The Challenge for IR of Deep Learning Systems
The Systems Add New Operator Add New Optimization Pass
Most DL Systems(e.g. old MXNet) Easy Fixed Set of Optimization Pass
LLVM Fixed Set of Primitive Ops Easy
NNVM Easy Easy
Comparison:
• Ease of adding new operator, optimization pass without changing the core interface
• Fixed interface is useful for decentralization
• New optimization directly usable by other projects, without pushing back to centralized repo
• Easy removable of not relevant passes
32. MXNet and XGBoost is developed by over 100 collaborators
Special thanks to
Tianqi Chen
UW
Mu Li
CMU/Amazon
Bing Xu
Turi
Chiyuan Zhang
MIT
Junyuan Xie
UW
Yizhi Liu
MediaV
Tianjun Xiao
Microsoft
Yutian Li
Stanford
Yuan Tang
Uptake
Qian Kou
Indiana University
Hu Shiwen
Shanghai
Chuntao Hong
Microsoft
Min Lin
Qihoo 360
Naiyan Wang
TuSimple
Tong He
Simon Fraser University
Minjie Wang
NYU
Valentin Churavy
OIST
Ali Farhadi
UW/AI2
Carlos Guestrin
UW/Turi
Alexander Smola
CMU/Amazon
Zheng Zhang
NYU Shanghai