1. Scalable Machine/Deep Learning with
Apache SystemML on Power
Berthold Reinwald
reinwald@us.ibm.com
IBM Research – Almaden
San Jose, CA
Nov. 17th, 2017
1
2. Agenda
Use cases
What is Apache SystemML
Demos on Power
– Handwritten Digits Image Classification
– Medical Image Segmentation
Inside SystemML
– Compiler, optimizer, and runtime
– Advanced Features
2
5. Enterprise Use cases for Scalable Machine Learning
5
Insurance
Problem Description
– optimal subset of features that leads to the best regr model
Problem Size
– 1.1M observations, 95 features, Subsets of 15 variables
Algorithm
– Parallelization of independent model building
Automotive
Problem Description
– Customer Satisfaction
Problem Size
– 2 mill cars with 8,000 reacquired cars, 10 mill repair cases, 25 mill
parts exchanges
Algorithms
– Logistic regression using ~22k feature variables
– Increasing the #features from ~250 to ~21,800, improved
precision/recall by order of magnitude
– Sequence mining using very low support value
– Very large number of intermediate result sequences.
Air Transportation
Problem Description
– Predict passenger volumes at locations in an airport
Problem Size
– WiFi data with ~66 M rows for ~1.3 M MAC addr.
Algorithms
– Multiple models per location, per passenger type
– Time-series analysis using seasonal and non-seasonal auto-
regressive, moving average components along with differencing
operations (Arima and Holt-Winters triple exponential smoothing)
Financial Services
Problem Description
– Compute correlations between Financial Analysts’
performance metrics and sentiments extracted from surveys
submitted by them
Algorithms
– Descriptive (Bivariate) Statistics: Chi-squared test, Spearman’s
Rho, Gamma, Kendall’s Tau-B, Odds-Ratio test, F-test (stratified
and unstratified)
Retail Banking
Problem Description
– Use statistical analysis on social media data linked to the bank’s
data to identify customer segments of interest, find predictors
of purchase intent, and gauge sentiment towards bank’s
products.
Algorithms
– Bivariate odds ratios and binomial proportions with confidence
intervals
Services Company
Problem
– Compute a benchmark index by mapping producers’ financial
reports into a normalized schema, using analytics to extrapolate
missing reports and/or impute missing values.
Algorithms
– Regularized least-squares loss minimization and Gibbs sampling
(MCMC) jointly over the parameter space and over the missing
(estimated) values
•
•
6. Why Apache SystemML
Today’s Roles of Data Scientists
– Algorithm researcher: Invent new optimization schemes
– Systems programmer: provide distributed
implementations
– Deployment engineer: Run for varying datasets
– Systems researcher: Optimize clusters
SystemML simplifies the Life of Data Scientists
– in implementing custom machine learning
– running algorithms distributed if needed
– running algorithms varying from small data to large data
NIPS ICML
KDD
JMLR
6
7. Apache SystemML – Declarative Machine Learning
Productivity of data scientists
– Machine learning language for data scientists
(“The SQL for analytics”)
– Strong foundation in linear algebra and statistical functions
– Comes with approx. 20+ algorithms pre-implemented
– Enable Solutions development and Tools
Scalability & Performance
– Built on data parallel platforms, e.g. Spark
Cost-based optimizer to compile execution plans
– Depending on data characteristics (tall/skinny, short/wide) and cluster
characteristics
– Ranging from in-memory single node to clusters (MapReduce, Spark),
and hybrid plans
APIs & Tools
– Command line: standalone Java app, spark-submit, hadoop jar
– Use in Spark through Scala, Python, R, and Java APIs
– Embeddable scoring library
– Tools: REPL (Scala Spark and pyspark), SparkR, SparkML,
Jupyter, Zeppelin Notebooks
Hadoop or
Spark Cluster
(scale-out)
In-Memory
Single Node
(scale-up)
Runtime
Compiler
Language
GPU backend
In progress
7
8. SystemML integrated in Spark Ecosystem
Spark Core Engine
Spark
SQL
Spark
Streaming (MLlib)
GraphX
(SystemML)
Analytics
Library
Custom
Analytics
Machine Learning
DataFrame
Spark API to SystemML
SystemML to run against Spark
core for distributed
computations
8
9. Apache SystemML Open Source
Apache Open source Project (http://systemml.apache.org/)
– Nov. 2015, Start SystemML Apache Incubator Project
– …
– Feb. 2017, Release 0.12.0 on Spark 1.6.x …, Python API.
May 2017, Release 0.14.0 on Spark 2.0.2+.
– May 2017, Apache Top Level Project
– Sep 2017, Release 0.15
Release downloads (http://systemml.apache.org/download)
– Binaries
– Coordinates to Maven repository
Github source code (https://github.com/apache/systemml)
Documentation (https://apache.github.io/systemml/)
3 Hours KDD Hands-On Tutorial (http://systemml.apache.org/tutorial-
kdd2017.html), Aug. 2017
9
10. SystemML’s Scalable Algorithms
Category Description
Descriptive Statistics
Univariate
Bivariate
Stratified Bivariate
Classification
Logistic Regression (multinomial)
Multi-Class SVM, non-linear SVM
Naïve Bayes (multinomial)
Decision Trees
Random Forest
kNN
Clustering k-Means
Regression
Linear Regression system of equations
CG (conjugate gradient descent)
Generalized Linear
Models (GLM)
Distributions: Gaussian, Poisson, Gamma, Inverse Gaussian, Binomial, Bernoulli
Links for all distributions: identity, log, sq. root, inverse, 1/μ2
Links for Binomial / Bernoulli: logit, probit, cloglog, cauchit
Stepwise
Linear
GLM
Lasso
Dimension Reduction PCA, Probabilistic PCA
Matrix Factorization ALS
direct solve
CG (conjugate gradient descent)
Survival Models
Kaplan Meier Estimate
Cox Proportional Hazard Regression
Deep Learning Autoencoder, word2vec, CNN, LSTM, RBM … and Deep Learning Library (DML-bodied) functions
Predict Algorithm-specific scoring
Transformation (native) Recoding, dummy coding, binning, scaling, missing value imputation
PMML models lm, kmeans, svm, glm, mlogit 10
11. Effect of Deep Learning: ImageNet Large-Scale Visual
Recognition Challenge
11
AlexNet
GoogleNet
ResNet (34 layer)
13. Layers
• Fully connected layer
• Convolution layer
• Less number of parameters as
compared to FC
• Useful to capture local
features (spatially)
• Output #channels = #filters
Reference: Convolutional Neural Networks for Visual Recognition. http://cs231n.github.io/
14
14. Deep Learning Support
NN library: Reuse existing infrastructure to implement
custom DNNs like other training algorithms
Small number of DL-specific built-in functions
– e.g. convolution
NN library of layers and training optimizers to stack layers, e.g.
– Affine (fully-connected) layer is matrix multiplication
– Convolution layer invokes new convolution function
Caffe/Keras2DML to import existing DNNs
Transfer learning to continue training on different data
GPU and native BLAS libraries
20. Code Generation for Operator Fusion
Motivation
– Ubiquitous Fusion Opportunities
– High Performance Impact
Key Ideas
– Templates skeletons (Row, Cell, Outer, MultiAgg)
– Candidate exploration to identify fusion opportunities
– Candidate selection via cost-based optimizer or heuristics
– Codegen with janino / javac during compile and dynamic recompile
X Y
b(*)u(^2) u(^2)
sumsum sum
Multi-Aggregate
a=sum(X^2)
b=sum(X*Y)
c=sum(Y^2)
X Y
Z*
sum
*
1st
pass
X
v
X
2nd
pass
q
┬
U V
┬X * logsum
sparsity
exploitation
27
21. Codegen Micro Benchmarks (FP64)
sum(X ʘ Y ʘ Z), dense sum(X ʘ Y ʘ Z), sparse
Sparsity
0.1
X
┬
(X v), dense
Data size
20K x 20K
sum(X ʘ log(UV
┬
+ 1e-15))
#1 Gen close
to hand-coded
fused ops
#2 TF/Julia Gen
only single-
threaded
#3 TF w/ very
limited sparse
support
#4 Sparse Gen
challenging,
Gen better
than hand-
coded ops
#5 TF w/ poor
performance
for data-
intensive ops,
#6 Gen at
peak mem
bandwidth
#7 Autom.
Sparsity
exploitation
across chains
of ops
22. SystemML on Power Environment
Contributed native ppc64le libraries for Jcuda to mavenized jcuda
project
– GPU backend on Power for SystemML
Contributed native ppc64le libraries to protoc project
– Useful for compiling Caffe proto files
Supported native BLAS operations in SystemML
– Matrix Multiplication, Convolution (forward/backward)
– OpenBLAS with OpenMP support
30
23. Linear Regression Conjugate Gradient
(preliminary 1/2)
31
0
2
4
6
8
10
12
14
64 128 256 512 1024 2048
TimeinSeconds
No. of Rows of input matrix (in Thousands)
PPC CPU Time
PPC GPU Time
x86 CPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
M-V multiplication
chain is memory bound,
But more cores help
with parallelization.
24. Linear Regression Conjugate Gradient
(preliminary 2/2)
32
0
2
4
6
8
10
12
14
64 256 1024
TimeinSeconds
No. of Rows of input matrix (in Thousands)
PPC GPU Time
x86 GPU Time
Data: random with sparsity 0.95, 1000 features
Icpt: 0, maxi: 20, tol: 0.001, reg: 0.01
Driver-memory: 100G, local[*] master
0
1
2
3
4
5
6
7
64 256 1024
TimeinSeconds
No. of Rows of input matrix (in Thousands)
CPU-GPU Transfer Time
PPC toDev Time
x86 toDev Time
Most of the time is spent
in transferring data from
host to device
-> 2x performance benefit
due to CPU-GPU NVLink
25. More Details
Matthias Boehm, Alexandre Evfimievski, Niketan Pansare, Berthold Reinwald, Prithvi Sen: Declarative, Large-Scale Machine Learning with
Apache SystemML, 3 hours hands-on tutorial, KDD 2017
Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V. Evfimievski, Shirish Tatikonda, Berthold Reinwald, Prithviray Sen: SPOOF: Sum-
Product Optimization and Operator Fusion for Large-Scale Machine Learning. CIDR 2017
Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, Berthold Reinwald: Compressed Linear Algebra for Large Scale
Machine Learning. VLDB 2016 (Best Paper Award)
– Extended Version to appear in VLDB Journal, 2017
– Summary Version to appear in ACM SIGMOD Record Research Highlights, 2017
Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold
Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, Shirish Tatikonda. SystemML: Declarative Machine Learning on Spark. VLDB
2016
Botong Huang, Matthias Boehm, Yuanyuan Tian, Berthold Reinwald, Shirish Tatikonda, Frederick R. Reiss: Resource Elasticity for Large-
Scale Machine Learning. SIGMOD 2015: 137-152
Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, P. Sadayappan: On optimizing
machine learning workloads via kernel fusion. PPOPP 2015: 173-182
Sebastian Schelter, Juan Soto, Volker Markl, Douglas Burdick, Berthold Reinwald, Alexandre V. Evfimievski: Efficient sample generation for
scalable meta learning. ICDE 2015: 1191-1202
Matthias Boehm, Douglas R. Burdick, Alexandre V. Evfimievski, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Shirish
Tatikonda, Yuanyuan Tian: SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng.
Bull. 37(3): 52-62 (2014)
Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, Prithviraj Sen, Yuanyuan Tian, Douglas Burdick, Shivakumar Vaithyanathan: Hybrid
Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB 7(7): 553-564 (2014)
Peter D. Kirchner, Matthias Boehm, Berthold Reinwald, Daby M. Sow, Michael Schmidt, Deepak S. Turaga, Alain Biem: Large Scale
Discriminative Metric Learning. IPDPS Workshop 2014: 1656-1663
Yuanyuan Tian, Shirish Tatikonda, Berthold Reinwald: Scalable and Numerically Stable Descriptive Statistics in SystemML. ICDE 2012: 1351-
1359
Amol Ghoting, Rajasekar Krishnamurthy, Edwin P. D. Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan
Tian, Shivakumar Vaithyanathan: SystemML: Declarative machine learning on MapReduce. ICDE 2011: 231-242
Custom
Algorithm
Optimizer
Resource
Elasticity
GPU
Sampling
Numeric
Stability
Task
Parallelism
1st paper
on Spark
Compression
Automatic
Rewr & Fusion
33
Hands on
Tutorial
26. Summary
SystemML simplifies the Life of Data Scientist
Custom Machine/Deep Learning Algorithms
Scale up & out
Mixed Workloads
– Memory access bound
– Compute bound
Strike Balance between
– Data transfer
– Parallelism
34