SlideShare uma empresa Scribd logo
1 de 64
Big Data Meets HPC: Exploiting HPC Technologies for
Accelerating Big Data Processing
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Talk at HPCAC Stanford Conference (Feb ‘18)
by
HPCAC-Stanford (Feb ’18) 2Network Based Computing Laboratory
Big Data
(Hadoop, Spark,
HBase,
Memcached,
etc.)
Deep Learning
(Caffe, TensorFlow, BigDL,
etc.)
HPC
(MPI, RDMA,
Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big
Data, and Deep Learning!
Increasing Need to Run these
applications on the Cloud!!
HPCAC-Stanford (Feb ’18) 3Network Based Computing Laboratory
• Traditional HPC
– Message Passing Interface (MPI), including MPI + OpenMP
– Support for PGAS and MPI + PGAS (OpenSHMEM, UPC)
– Exploiting Accelerators
• Deep Learning
– Caffe and CNTK
• Cloud for HPC
– Virtualization with SR-IOV and Containers
Focus of Yesterday’s Talk: HPC, Deep Learning, and Cloud
HPCAC-Stanford (Feb ’18) 4Network Based Computing Laboratory
• Big Data/Enterprise/Commercial Computing
– Spark and Hadoop (HDFS, HBase, MapReduce)
– Memcached for Web 2.0
– Kafka for Stream Processing
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Focus of Today’s Talk: HPC, BigData, Deep Learning and Cloud
HPCAC-Stanford (Feb ’18) 5Network Based Computing Laboratory
Big Velocity – How Much Data Is Generated Every Minute on the Internet?
The global Internet population grew 7.5% from 2016 and now represents
3.7 Billion People.
Courtesy: https://www.domo.com/blog/data-never-sleeps-5/
HPCAC-Stanford (Feb ’18) 6Network Based Computing Laboratory
• Substantial impact on designing and utilizing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)
• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)
• HDFS, MapReduce, Spark
Data Management and Processing on Modern Clusters
HPCAC-Stanford (Feb ’18) 7Network Based Computing Laboratory
(1) Prepare
Datasets @Scale
(2) Deep
Learning @Scale
(3) Non-deep
learning
analytics @Scale
(4) Apply ML
model @Scale
• Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms
• More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big
data stacks, such as Apache Hadoop and Spark
• Benefits of the DLoBD approach
– Easily build a powerful data analytics pipeline
• E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof
– Better data locality
– Efficient resource sharing and cost effective
Deep Learning over Big Data (DLoBD)
HPCAC-Stanford (Feb ’18) 8Network Based Computing Laboratory
• Scientific Data Management, Analysis, and Visualization
• Applications examples
– Climate modeling
– Combustion
– Fusion
– Astrophysics
– Bioinformatics
• Data Intensive Tasks
– Runs large-scale simulations on supercomputers
– Dump data on parallel storage systems
– Collect experimental / observational data
– Move experimental / observational data to analysis sites
– Visual analytics – help understand data visually
Not Only in Internet Services - Big Data in Scientific Domains
HPCAC-Stanford (Feb ’18) 9Network Based Computing Laboratory
Drivers of Modern HPC Cluster and Data Center Architecture
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
– Single Root I/O Virtualization (SR-IOV)
• Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters
• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
High Performance Interconnects –
InfiniBand (with SR-IOV)
<1usec latency, 200Gbps Bandwidth>
Multi-/Many-core
Processors
Cloud CloudSDSC Comet TACC Stampede
Accelerators / Coprocessors
high compute density, high
performance/watt
>1 TFlop DP on a chip
SSD, NVMe-SSD, NVRAM
HPCAC-Stanford (Feb ’18) 10Network Based Computing Laboratory
Interconnects and Protocols in OpenFabrics Stack for HPC
(http://openfabrics.org)
Kernel
Space
Application /
Middleware
Verbs
Ethernet
Adapter
Ethernet
Switch
Ethernet
Driver
TCP/IP
1/10/40/100
GigE
InfiniBand
Adapter
InfiniBand
Switch
IPoIB
IPoIB
Ethernet
Adapter
Ethernet
Switch
Hardware
Offload
TCP/IP
10/40 GigE-
TOE
InfiniBand
Adapter
InfiniBand
Switch
User
Space
RSockets
RSockets
iWARP
Adapter
Ethernet
Switch
TCP/IP
User
Space
iWARP
RoCE
Adapter
Ethernet
Switch
RDMA
User
Space
RoCE
InfiniBand
Switch
InfiniBand
Adapter
RDMA
User
Space
IB Native
Sockets
Application /
Middleware Interface
Protocol
Adapter
Switch
InfiniBand
Adapter
InfiniBand
Switch
RDMA
SDP
SDP
HPCAC-Stanford (Feb ’18) 11Network Based Computing Laboratory
How Can HPC Clusters with High-Performance Interconnect and Storage
Architectures Benefit Big Data Applications?
Bring HPC and Big Data processing into a
“convergent trajectory”!
What are the major
bottlenecks in current Big
Data processing
middleware (e.g. Hadoop,
Spark, and Memcached)?
Can the bottlenecks be
alleviated with new
designs by taking
advantage of HPC
technologies?
Can RDMA-enabled
high-performance
interconnects
benefit Big Data
processing?
Can HPC Clusters with
high-performance
storage systems (e.g.
SSD, parallel file
systems) benefit Big
Data applications?
How much
performance benefits
can be achieved
through enhanced
designs?
How to design
benchmarks for
evaluating the
performance of Big
Data middleware on
HPC clusters?
HPCAC-Stanford (Feb ’18) 12Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on
Existing HPC Infrastructure?
HPCAC-Stanford (Feb ’18) 13Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on
Existing HPC Infrastructure?
HPCAC-Stanford (Feb ’18) 14Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on
Existing HPC Infrastructure?
Spark Job
Hadoop Job Deep Learning
Job
HPCAC-Stanford (Feb ’18) 15Network Based Computing Laboratory
Designing Communication and I/O Libraries for Big
Data Systems: Challenges
Big Data Middleware
(HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40/100 GigE
and Intelligent NICs)
Storage Technologies
(HDD, SSD, NVM, and NVMe-
SSD)
Programming Models
(Sockets)
Applications
Commodity Computing System
Architectures
(Multi- and Many-core
architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point
Communication
QoS & Fault Tolerance
Threaded Models
and Synchronization
Performance TuningI/O and File Systems
Virtualization (SR-IOV)
Benchmarks
Upper level
Changes?
HPCAC-Stanford (Feb ’18) 16Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 275 organizations from 34 countries
• More than 25,000 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
Also run on Ethernet
HPCAC-Stanford (Feb ’18) 17Network Based Computing Laboratory
HiBD Release Timeline and Downloads
0
5000
10000
15000
20000
25000
30000
NumberofDownloads
Timeline
RDMA-Hadoop2.x0.9.1
RDMA-Hadoop2.x0.9.6
RDMA-Memcached0.9.4
RDMA-Hadoop2.x0.9.7
RDMA-Spark0.9.1
RDMA-HBase0.9.1
RDMA-Hadoop2.x1.0.0
RDMA-Spark0.9.4
RDMA-Hadoop2.x1.3.0
RDMA-Memcached0.9.6&OHB-0.9.3
RDMA-Spark0.9.5
RDMA-Hadoop1.x0.9.0
RDMA-Hadoop1.x0.9.8
RDMA-Memcached0.9.1&OHB-0.7.1
RDMA-Hadoop1.x0.9.9
HPCAC-Stanford (Feb ’18) 18Network Based Computing Laboratory
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and
RPC components
– Enhanced HDFS with in-memory and heterogeneous storage
– High performance design of MapReduce over Lustre
– Memcached-based burst buffer for MapReduce over Lustre-integrated HDFS (HHH-L-BB mode)
– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP
– Support for OpenPOWER
• Current release: 1.3.0
– Based on Apache Hadoop 2.8.0
– Compliant with Apache Hadoop 2.8.0, HDP 2.5.0.3 and CDH 5.8.2 APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms (x86, POWER)
• Different file systems with disks and SSDs and Lustre
RDMA for Apache Hadoop 2.x Distribution
http://hibd.cse.ohio-state.edu
HPCAC-Stanford (Feb ’18) 19Network Based Computing Laboratory
• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well
as performance. This mode is enabled by default in the package.
• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-
memory and obtain as much performance benefit as possible.
• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst
buffer design is hosted by Memcached servers, each of which has a local SSD.
• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top
of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-
L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x
HPCAC-Stanford (Feb ’18) 20Network Based Computing Laboratory
• High-Performance Design of Spark over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark
– RDMA-based data shuffle and SEDA-based shuffle architecture
– Non-blocking and chunk-based data transfer
– Off-JVM-heap buffer management
– Support for OpenPOWER
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.5
– Based on Apache Spark 2.1.0
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms (x86, POWER)
• RAM disks, SSDs, and HDD
– http://hibd.cse.ohio-state.edu
RDMA for Apache Spark Distribution
HPCAC-Stanford (Feb ’18) 21Network Based Computing Laboratory
• High-Performance Design of HBase over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level
for HBase
– Compliant with Apache HBase 1.1.2 APIs and applications
– On-demand connection setup
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.1
– Based on Apache HBase 1.1.2
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
– http://hibd.cse.ohio-state.edu
RDMA for Apache HBase Distribution
HPCAC-Stanford (Feb ’18) 22Network Based Computing Laboratory
• High-Performance Design of Memcached over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Memcached and
libMemcached components
– High performance design of SSD-Assisted Hybrid Memory
– Non-Blocking Libmemcached Set/Get API extensions
– Support for burst-buffer mode in Lustre-integrated design of HDFS in RDMA for Apache Hadoop-2.x
– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with
IPoIB)
• Current release: 0.9.6
– Based on Memcached 1.5.3 and libMemcached 1.0.18
– Compliant with libMemcached APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
• SSD
– http://hibd.cse.ohio-state.edu
RDMA for Memcached Distribution
HPCAC-Stanford (Feb ’18) 23Network Based Computing Laboratory
• Micro-benchmarks for Hadoop Distributed File System (HDFS)
– Sequential Write Latency (SWL) Benchmark, Sequential Read Latency (SRL) Benchmark, Random Read Latency
(RRL) Benchmark, Sequential Write Throughput (SWT) Benchmark, Sequential Read Throughput (SRT)
Benchmark
– Support benchmarking of
• Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS, Cloudera Distribution of Hadoop (CDH)
HDFS
• Micro-benchmarks for Memcached
– Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark, Non-Blocking API Latency Benchmark, Hybrid
Memory Latency Benchmark
– Yahoo! Cloud Serving Benchmark (YCSB) Extension for RDMA-Memcached
• Micro-benchmarks for HBase
– Get Latency Benchmark, Put Latency Benchmark
• Micro-benchmarks for Spark
– GroupBy, SortBy
• Current release: 0.9.3
• http://hibd.cse.ohio-state.edu
OSU HiBD Micro-Benchmark (OHB) Suite – HDFS, Memcached, HBase, and Spark
HPCAC-Stanford (Feb ’18) 24Network Based Computing Laboratory
Using HiBD Packages on Existing HPC Infrastructure
HPCAC-Stanford (Feb ’18) 25Network Based Computing Laboratory
Using HiBD Packages on Existing HPC Infrastructure
HPCAC-Stanford (Feb ’18) 26Network Based Computing Laboratory
• RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and
available on SDSC Comet.
– Examples for various modes of usage are available in:
• RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP
• RDMA for Apache Spark: /share/apps/examples/SPARK/
– Please email help@xsede.org (reference Comet as the machine, and SDSC as the
site) if you have any further questions about usage and configuration.
• RDMA for Apache Hadoop is also available on Chameleon Cloud as an
appliance
– https://www.chameleoncloud.org/appliances/17/
HiBD Packages on SDSC Comet and Chameleon Cloud
M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC
Comet, XSEDE’16, July 2016
HPCAC-Stanford (Feb ’18) 27Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Stanford (Feb ’18) 28Network Based Computing Laboratory
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based HDFS with communication library written in native code
Design Overview of HDFS with RDMA
HDFS
Verbs
RDMA Capable Networks
(IB, iWARP, RoCE ..)
Applications
1/10/40/100 GigE, IPoIB
Network
Java Socket Interface Java Native Interface (JNI)
WriteOthers
OSU Design
• Design Features
– RDMA-based HDFS write
– RDMA-based HDFS
replication
– Parallel replication support
– On-demand connection
setup
– InfiniBand/RoCE support
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS
over InfiniBand , Supercomputing (SC), Nov 2012
N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014
HPCAC-Stanford (Feb ’18) 29Network Based Computing Laboratory
Triple-H
Heterogeneous Storage
• Design Features
– Three modes
• Default (HHH)
• In-Memory (HHH-M)
• Lustre-Integrated (HHH-L)
– Policies to efficiently utilize the heterogeneous
storage devices
• RAM, SSD, HDD, Lustre
– Eviction/Promotion based on data usage
pattern
– Hybrid Replication
– Lustre-Integrated mode:
• Lustre-based fault-tolerance
Enhanced HDFS with In-Memory and Heterogeneous Storage
Hybrid Replication
Data Placement Policies
Eviction/Promotion
RAM Disk SSD HDD
Lustre
N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters
with Heterogeneous Storage Architecture, CCGrid ’15, May 2015
Applications
HPCAC-Stanford (Feb ’18) 30Network Based Computing Laboratory
Design Overview of MapReduce with RDMA
MapReduce
Verbs
RDMA Capable Networks
(IB, iWARP, RoCE ..)
OSU Design
Applications
1/10/40/100 GigE, IPoIB
Network
Java Socket Interface Java Native Interface (JNI)
Job
Tracker
Task
Tracker
Map
Reduce
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based MapReduce with communication library written in native code
• Design Features
– RDMA-based shuffle
– Prefetching and caching map output
– Efficient Shuffle Algorithms
– In-memory merge
– On-demand Shuffle Adjustment
– Advanced overlapping
• map, shuffle, and merge
• shuffle, merge, and reduce
– On-demand connection setup
– InfiniBand/RoCE support
M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in
MapReduce over High Performance Interconnects, ICS, June 2014
HPCAC-Stanford (Feb ’18) 31Network Based Computing Laboratory
0
50
100
150
200
250
300
350
400
80 120 160
ExecutionTime(s)
Data Size (GB)
IPoIB (EDR)
OSU-IB (EDR)
0
100
200
300
400
500
600
700
800
80 160 240
ExecutionTime(s)
Data Size (GB)
IPoIB (EDR)
OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x –
RandomWriter & TeraGen in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps
• RandomWriter
– 3x improvement over IPoIB
for 80-160 GB file size
• TeraGen
– 4x improvement over IPoIB for
80-240 GB file size
RandomWriter TeraGen
Reduced by 3x Reduced by 4x
HPCAC-Stanford (Feb ’18) 32Network Based Computing Laboratory
0
100
200
300
400
500
600
700
800
80 120 160
ExecutionTime(s)
Data Size (GB)
IPoIB (EDR)
OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x – Sort & TeraSort
in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of
64 maps and 32 reduces
• Sort
– 61% improvement over IPoIB for
80-160 GB data
• TeraSort
– 18% improvement over IPoIB for
80-240 GB data
Reduced by 61%
Reduced by 18%
Cluster with 8 Nodes with a total of
64 maps and 14 reduces
Sort TeraSort
0
100
200
300
400
500
600
80 160 240
ExecutionTime(s)
Data Size (GB)
IPoIB (EDR)
OSU-IB (EDR)
HPCAC-Stanford (Feb ’18) 33Network Based Computing Laboratory
Evaluation with Spark on SDSC Gordon (HHH vs. Tachyon/Alluxio)
• For 200GB TeraGen on 32 nodes
– Spark-TeraGen: HHH has 2.4x improvement over Tachyon; 2.3x over HDFS-IPoIB (QDR)
– Spark-TeraSort: HHH has 25.2% improvement over Tachyon; 17% over HDFS-IPoIB (QDR)
0
20
40
60
80
100
120
140
160
180
8:50 16:100 32:200
ExecutionTime(s)
Cluster Size : Data Size (GB)
IPoIB (QDR) Tachyon OSU-IB (QDR)
0
100
200
300
400
500
600
700
8:50 16:100 32:200
ExecutionTime(s)
Cluster Size : Data Size (GB)
Reduced
by 2.4x
Reduced by 25.2%
TeraGen TeraSort
N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File
Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015
HPCAC-Stanford (Feb ’18) 34Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Stanford (Feb ’18) 35Network Based Computing Laboratory
• Design Features
– RDMA based shuffle plugin
– SEDA-based architecture
– Dynamic connection
management and sharing
– Non-blocking data transfer
– Off-JVM-heap buffer
management
– InfiniBand/RoCE support
Design Overview of Spark with RDMA
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Scala based Spark with communication library written in native code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High
Performance Interconnects (HotI'14), August 2014
X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016.
Spark Core
RDMA Capable Networks
(IB, iWARP, RoCE ..)
Apache Spark Benchmarks/Applications/Libraries/Frameworks
1/10/40/100 GigE, IPoIB Network
Java Socket Interface Java Native Interface (JNI)
Native RDMA-based Comm. Engine
Shuffle Manager (Sort, Hash, Tungsten-Sort)
Block Transfer Service (Netty, NIO, RDMA-Plugin)
Netty
Server
NIO
Server
RDMA
Server
Netty
Client
NIO
Client
RDMA
Client
HPCAC-Stanford (Feb ’18) 36Network Based Computing Laboratory
• InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)
• RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.
– SortBy: Total time reduced by up to 80% over IPoIB (56Gbps)
– GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – SortBy/GroupBy
64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time
0
50
100
150
200
250
300
64 128 256
Time(sec)
Data Size (GB)
IPoIB
RDMA
0
50
100
150
200
250
64 128 256
Time(sec)
Data Size (GB)
IPoIB
RDMA
74%80%
HPCAC-Stanford (Feb ’18) 37Network Based Computing Laboratory
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.
– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)
Performance Evaluation on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0
50
100
150
200
250
300
350
400
450
Huge BigData Gigantic
Time(sec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData Gigantic
Time(sec)
Data Size (GB)
IPoIB
RDMA
43%37%
HPCAC-Stanford (Feb ’18) 38Network Based Computing Laboratory
Application Evaluation on SDSC Comet
• Kira Toolkit: Distributed astronomy image processing
toolkit implemented using Apache Spark
– https://github.com/BIDS/Kira
• Source extractor application, using a 65GB dataset from
the SDSS DR2 survey that comprises 11,150 image files.
0
20
40
60
80
100
120
RDMA Spark Apache Spark
(IPoIB)
21 %
Execution times (sec) for Kira SE benchmark
using 65 GB dataset, 48 cores.
M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC
Comet, XSEDE’16, July 2016
0
200
400
600
800
1000
24 48 96 192 384
OneEpochTime(sec)
Number of cores
IPoIB RDMA
• BigDL: Distributed Deep Learning Tool using Apache
Spark
– https://github.com/intel-analytics/BigDL
• VGG training model on the CIFAR-10 dataset
4.58x
HPCAC-Stanford (Feb ’18) 39Network Based Computing Laboratory
• Challenges
– Performance characteristics of RDMA-based Hadoop and Spark over OpenPOWER systems with IB networks
– Can RDMA-based Big Data middleware demonstrate good performance benefits on OpenPOWER systems?
– Any new accelerations for Big Data middleware on OpenPOWER systems to attain better benefits?
• Results
– For the Spark SortBy benchmark, RDMA-Opt outperforms IPoIB and RDMA-Def by 23.5% and 10.5%
Accelerating Hadoop and Spark with RDMA over OpenPOWER
0
20
40
60
80
100
120
140
160
180
200
32GB 64GB 96GB
Latency(sec)
IPoIB RDMA-Def RDMA-Opt
Spark SortBy
Reduced
by 23.5%
Socket 0
Chip
Interconnect
Memory
Controller(s)
Connect to
DDR4 Memory
PCIe
Interface
Connect
to IB HCA
Socket 1
(SMT)
Core
(SMT)
Core
(SMT)
Core
(SMT)
Core
(SMT)
Core
(SMT)
Core
Chip
Interconnect
(SMT)
Core
(SMT)
Core
(SMT)
Core
(SMT)
Core
(SMT)
Core
(SMT)
Core
Memory
Controller(s)
Connect to
DDR4 Memory
PCIe
Interface
Connect
to IB HCA
IBM POWER8 Architecture
X. Lu, H. Shi, D. Shankar and D. K. Panda, Performance Characterization and Acceleration of Big Data Workloads on OpenPOWER System, IEEE BigData 2017.
HPCAC-Stanford (Feb ’18) 40Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Stanford (Feb ’18) 41Network Based Computing Laboratory
1
10
100
1000
1
2
4
8
16
32
64
128
256
512
1K
2K
4K
Time(us)
Message Size
OSU-IB (FDR)
0
200
400
600
800
16 32 64 128 256 512 102420484080
Thousandsof
Transactionsper
Second(TPS)
No. of Clients
• Memcached Get latency
– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us, 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us
• Memcached Throughput (4bytes)
– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s, Nearly 2X improvement in throughput
Memcached GET Latency Memcached Throughput
Memcached Performance (FDR Interconnect)
Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)
Latency Reduced
by nearly 20X
2X
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High
Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid
Transport, CCGrid’12
HPCAC-Stanford (Feb ’18) 42Network Based Computing Laboratory
• Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing
tool
• Memcached-RDMA can
- improve query latency by up to 66% over IPoIB (32Gbps)
- throughput by up to 69% over IPoIB (32Gbps)
Micro-benchmark Evaluation for OLDP workloads
0
1
2
3
4
5
6
7
8
64 96 128 160 320 400
Latency(sec)
No. of Clients
Memcached-IPoIB (32Gbps)
Memcached-RDMA (32Gbps)
0
1000
2000
3000
4000
64 96 128 160 320 400
Throughput(Kq/s)
No. of Clients
Memcached-IPoIB (32Gbps)
Memcached-RDMA (32Gbps)
D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads
with Memcached and MySQL, ISPASS’15
Reduced by 66%
HPCAC-Stanford (Feb ’18) 43Network Based Computing Laboratory
– Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve
• >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty
• >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid
– Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x
improvement over IPoIB-Mem
Performance Evaluation with Non-Blocking Memcached API
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set Get
IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB-
i
H-RDMA-Opt-NonB-
b
AverageLatency(us)
MissPenalty(BackendDBAccessOverhead)
ClientWait
ServerResponse
CacheUpdate
Cachecheck+Load(Memoryand/orSSDread)
SlabAllocation(w/SSDwriteonOut-of-Mem)
H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API
NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee
HPCAC-Stanford (Feb ’18) 44Network Based Computing Laboratory
0
200
400
600
800
1000
100 1000 10000 100000
Throughput(MB/s)
Record Size
0
500
1000
1500
2000
2500
100 1000 10000 100000
Latency(ms)
Record Size
RDMA
IPoIB
RDMA-Kafka: High-Performance Message Broker for Streaming Workloads
Kafka Producer Benchmark (1 Broker 4 Producers)
• Experiments run on OSU-RI2 cluster
• 2.4GHz 28 cores, InfiniBand EDR, 512 GB RAM, 400GB SSD
– Up to 98% improvement in latency compared to IPoIB
– Up to 7x increase in throughput over IPoIB
98%
7x
H. Javed, X. Lu , and D. K. Panda, Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka,
BDCAT’17
HPCAC-Stanford (Feb ’18) 45Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Stanford (Feb ’18) 46Network Based Computing Laboratory
Overview of RDMA-gRPC with TensorFlow
Worker services communicate among each other using RDMA-gRPC
Client Master
Worker
/job:PS/task:0
Worker
/job:Worker/task:0
CPU GPU
RDMA gRPC server/ client
CPU GPU
RDMA gRPC server/ client
HPCAC-Stanford (Feb ’18) 47Network Based Computing Laboratory
Performance Benefits for RDMA-gRPC with Micro-Benchmark
RDMA-gRPC RPC Latency
• AR-gRPC (OSU design) Latency on SDSC-Comet-FDR
– Up to 2.7x performance speedup over IPoIB for Latency for small messages
– Up to 2.8x performance speedup over IPoIB for Latency for medium messages
– Up to 2.5x performance speedup over IPoIB for Latency for large messages
0
15
30
45
60
75
90
2 8 32 128 512 2K 8K
Latency(us)
payload (Bytes)
Default gRPC
AR-gRPC
0
200
400
600
800
1000
16K 32K 64K 128K 256K 512KLatency(us)
Payload (Bytes)
Default gRPC
AR-gRPC
100
3800
7500
11200
14900
18600
1M 2M 4M 8M
Latency(us)
Payload (Bytes)
Default gRPC
AR-gRPC
R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? Under Review
HPCAC-Stanford (Feb ’18) 48Network Based Computing Laboratory
Performance Benefit for TensorFlow
• TensorFlow performance evaluation on an IB EDR cluster
– Up to 19% performance speedup over IPoIB for Sigmoid net (20 epochs)
– Up to 35% and 31% performance speedup over IPoIB for resnet50 and Inception3 (batch size 8)
sigmoid net CNN
0
10
20
30
40
50
4 Nodes (1 PS, 3 Workers)
Time(Seconds),Lowerbetter
Default gRPC
gRPC + Verbs
gRPC + MPI
AR-gRPC
0
10
20
30
40
50
60
70
80
resnet50 inception3
Images/Second,HigherBetter
4 Nodes (1 PS, 3 Workers)Default gRPC
gRPC + Verbs
gRPC + MPI
AR-gRPC
19%
35%
31%
HPCAC-Stanford (Feb ’18) 49Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Stanford (Feb ’18) 50Network Based Computing Laboratory
• Layers of DLoBD Stacks
– Deep learning application layer
– Deep learning library layer
– Big data analytics framework layer
– Resource scheduler layer
– Distributed file system layer
– Hardware resource layer
• How much performance benefit we
can achieve for end deep learning
applications?
Overview of DLoBD Stacks
Sub-optimal
Performance
?
HPCAC-Stanford (Feb ’18) 51Network Based Computing Laboratory
Performance Evaluation for IPoIB and RDMA with CaffeOnSpark and
TensorFlowOnSpark (IB EDR)
• CaffeOnSpark benefits from the high performance of RDMA compared to IPoIB once
communication overhead becomes significant.
• Our experiments show that the default RDMA design in TensorFlowOnSpark is not fully
optimized yet. For MNIST tests, RDMA is not showing obvious benefits.
0
50
100
150
200
250
300
350
400
450
IPoIB RDMA IPoIB RDMA IPoIB RDMA IPoIB RDMA
GPU GPU-cuDNN GPU GPU-cuDNN
CIFAR-10 Quick LeNet on MNIST
End-to-EndTime(secs)
CaffeOnSpark
2 4 8 16
0
50
100
150
200
250
300
350
400
450
IPoIB RDMA IPoIB RDMA
CIFAR10 (2 GPUs/node) MNIST (1 GPU/node)
End-to-EndTime(secs)
TensorFlowOnSpark
2 GPUs
4 GPUs
8 GPUs
14.2%
13.3%
51.2%
45.6%
33.01%
4.9%
HPCAC-Stanford (Feb ’18) 52Network Based Computing Laboratory
Epoch-Level Evaluation with BigDL on SDSC Comet
0
10
20
30
40
50
60
10
510
1010
1510
2010
2510
3010
3510
4010
4510
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Accuracy(%)
AccumulativeEpochsTime(secs)
Epoch Number
IPoIB-Time
RDMA-Time
IPoIB-Accuracy
RDMA-Accuracy
VGG on CIFAR-10 with BigDL
• Epoch-level evaluation of
training VGG model using
BigDL on default Spark with
IPoIB and our RDMA-based
Spark.
• RDMA version takes
constantly less time than
the IPoIB version to finish
every epoch.
– RDMA finishes epoch 18 in
2.6x time faster than IPoIB
2.6X
HPCAC-Stanford (Feb ’18) 53Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Stanford (Feb ’18) 54Network Based Computing Laboratory
Overview of RDMA-Hadoop-Virt Architecture
• Virtualization-aware modules in all the four
main Hadoop components:
– HDFS: Virtualization-aware Block Management
to improve fault-tolerance
– YARN: Extensions to Container Allocation Policy
to reduce network traffic
– MapReduce: Extensions to Map Task Scheduling
Policy to reduce network traffic
– Hadoop Common: Topology Detection Module
for automatic topology detection
• Communications in HDFS, MapReduce, and RPC
go through RDMA-based designs over SR-IOV
enabled InfiniBand
HDFS
YARN
HadoopCommon
MapReduce
HBase Others
Virtual Machines Bare-Metal nodesContainers
Big Data ApplicationsTopologyDetectionModule
Map Task Scheduling
Policy Extension
Container Allocation
Policy Extension
CloudBurst MR-MS Polygraph Others
Virtualization Aware
Block Management
S. Gugnani, X. Lu, D. K. Panda. Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on
SR-IOV-enabled Clouds. CloudCom, 2016.
HPCAC-Stanford (Feb ’18) 55Network Based Computing Laboratory
Evaluation with Applications
– 14% and 24% improvement with Default Mode for CloudBurst and Self-Join
– 30% and 55% improvement with Distributed Mode for CloudBurst and Self-Join
0
20
40
60
80
100
Default Mode Distributed Mode
EXECUTIONTIME CloudBurst
RDMA-Hadoop RDMA-Hadoop-Virt
0
50
100
150
200
250
300
350
400
Default Mode Distributed Mode
EXECUTIONTIME
Self-Join
RDMA-Hadoop RDMA-Hadoop-Virt
30%
reduction
55%
reduction
Will be released in future
HPCAC-Stanford (Feb ’18) 56Network Based Computing Laboratory
• Basic Designs for HiBD Packages
– HDFS and MapReduce
– Spark
– Memcached
– Kafka
• Deep Learning
– TensorFlow with gRPC
– Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..)
• Cloud Support for Big Data Stacks
– Virtualization with SR-IOV and Containers
– Swift Object Store
Acceleration Case Studies and Performance Evaluation
HPCAC-Stanford (Feb ’18) 57Network Based Computing Laboratory
• Distributed Cloud-based Object Storage Service
• Deployed as part of OpenStack installation
• Can be deployed as standalone storage solution as well
• Worldwide data access via Internet
– HTTP-based
• Architecture
– Multiple Object Servers: To store data
– Few Proxy Servers: Act as a proxy for all requests
– Ring: Handles metadata
• Usage
– Input/output source for Big Data applications (most common use
case)
– Software/Data backup
– Storage of VM/Docker images
• Based on traditional TCP sockets communication
OpenStack Swift Overview
Send PUT or GET request
PUT/GET /v1/<account>/<container>/<object>
Proxy
Server
Object
Server
Object
Server
Object
Server
Ring
Disk 1
Disk 2
Disk 1
Disk 2
Disk 1
Disk 2
Swift Architecture
HPCAC-Stanford (Feb ’18) 58Network Based Computing Laboratory
• Challenges
– Proxy server is a bottleneck for large scale deployments
– Object upload/download operations network intensive
– Can an RDMA-based approach benefit?
• Design
– Re-designed Swift architecture for improved scalability and
performance; Two proposed designs:
• Client-Oblivious Design: No changes required on the client side
• Metadata Server-based Design: Direct communication between
client and object servers; bypass proxy server
– RDMA-based communication framework for accelerating
networking performance
– High-performance I/O framework to provide maximum
overlap between communication and I/O
Swift-X: Accelerating OpenStack Swift with RDMA for Building
Efficient HPC Clouds
S. Gugnani, X. Lu, and D. K. Panda, Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud,
accepted at CCGrid’17, May 2017
Client-Oblivious Design
(D1)
Metadata Server-based
Design (D2)
HPCAC-Stanford (Feb ’18) 59Network Based Computing Laboratory
0
5
10
15
20
25
Swift
PUT
Swift-X
(D1)
PUT
Swift-X
(D2)
PUT
Swift
GET
Swift-X
(D1)
GET
Swift-X
(D2)
GET
LATENCY(S)
TIME BREAKUP OF GET AND PUT
Communication I/O Hashsum Other
Swift-X: Accelerating OpenStack Swift with RDMA for Building
Efficient HPC Clouds
0
5
10
15
20
1MB 4MB 16MB 64MB 256MB 1GB 4GB
LATENCY(s)
OBJECT SIZE
GET LATENCY EVALUATION
Swift Swift-X (D2) Swift-X (D1)
Reduced
by 66%
• Up to 66% reduction in GET latency• Communication time reduced by up to
3.8x for PUT and up to 2.8x for GET
HPCAC-Stanford (Feb ’18) 60Network Based Computing Laboratory
• Discussed challenges in accelerating BigData and Deep Learning Stacks with HPC
technologies
• Presented basic and advanced designs to take advantage of InfiniBand/RDMA for
HDFS, MapReduce, HBase, Spark, Memcached, gRPC, and TensorFlow on HPC clusters
and clouds
• Results are promising for Big Data processing and the associated Deep Learning tools
• Many other open issues need to be solved
• Will enable Big Data and Deep Learning communities to take advantage of modern HPC
technologies to carry out their analytics in a fast and scalable manner
Concluding Remarks
HPCAC-Stanford (Feb ’18) 61Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by
HPCAC-Stanford (Feb ’18) 62Network Based Computing Laboratory
Personnel Acknowledgments
Current Students (Graduate)
– A. Awan (Ph.D.)
– R. Biswas (M.S.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist
– K. Hamidouche
– S. Sur
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– J. Hashmi (Ph.D.)
– H. Javed (Ph.D.)
– P. Kousha (Ph.D.)
– D. Shankar (Ph.D.)
– H. Shi (Ph.D.)
– J. Zhang (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Scientists
– X. Lu
– H. Subramoni
Past Programmers
– D. Bureddy
– J. Perkins
Current Research Specialist
– J. Smith
– M. Arnold
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc
– A. Ruhela
Current Students (Undergraduate)
– N. Sarkauskas (B.S.)
HPCAC-Stanford (Feb ’18) 63Network Based Computing Laboratory
The 4th International Workshop on
High-Performance Big Data Computing (HPBDC)
HPBDC 2018 will be held with IEEE International Parallel and Distributed Processing
Symposium (IPDPS 2018), Vancouver, British Columbia CANADA, May, 2018
http://web.cse.ohio-state.edu/~luxi/hpbdc2018
HPBDC 2017 was held in conjunction with IPDPS’17
Keynote Talk: Prof. Satoshi Matsuoka, Converging HPC and Big Data / AI Infrastructures at Scale with BYTES-Oriented Architectures
Panel Topic: Sunrise or Sunset: Exploring the Design Space of Big Data Software Stack
Six Regular Research Papers and One Short Research Papers
http://web.cse.ohio-state.edu/~luxi/hpbdc2017
HPBDC 2016 was held in conjunction with IPDPS’16
Keynote Talk: Dr. Chaitanya Baru, High Performance Computing and Which Big Data?
Panel Topic: Merge or Split: Mutual Influence between Big Data and HPC Techniques
Six Regular Research Papers and Two Short Research Papers
http://web.cse.ohio-state.edu/~luxi/hpbdc2016
HPCAC-Stanford (Feb ’18) 64Network Based Computing Laboratory
panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Thank You!
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Project
http://hibd.cse.ohio-state.edu/

Mais conteúdo relacionado

Mais procurados

Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Kevin Crocker
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemAnshul Bhatnagar
 

Mais procurados (20)

Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
HDFS
HDFSHDFS
HDFS
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 

Semelhante a Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologiesinside-BigData.com
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...inside-BigData.com
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansAndrey Kudryavtsev
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFSUSE Italy
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceObject Automation
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCObject Automation
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
 

Semelhante a Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing (20)

Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Accelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing TechnologiesAccelerate Big Data Processing with High-Performance Computing Technologies
Accelerate Big Data Processing with High-Performance Computing Technologies
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
DUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution PlansDUG'20: 13 - HPE’s DAOS Solution Plans
DUG'20: 13 - HPE’s DAOS Solution Plans
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPC
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 

Mais de inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mais de inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Último

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Processing

  • 1. Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Talk at HPCAC Stanford Conference (Feb ‘18) by
  • 2. HPCAC-Stanford (Feb ’18) 2Network Based Computing Laboratory Big Data (Hadoop, Spark, HBase, Memcached, etc.) Deep Learning (Caffe, TensorFlow, BigDL, etc.) HPC (MPI, RDMA, Lustre, etc.) Increasing Usage of HPC, Big Data and Deep Learning Convergence of HPC, Big Data, and Deep Learning! Increasing Need to Run these applications on the Cloud!!
  • 3. HPCAC-Stanford (Feb ’18) 3Network Based Computing Laboratory • Traditional HPC – Message Passing Interface (MPI), including MPI + OpenMP – Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) – Exploiting Accelerators • Deep Learning – Caffe and CNTK • Cloud for HPC – Virtualization with SR-IOV and Containers Focus of Yesterday’s Talk: HPC, Deep Learning, and Cloud
  • 4. HPCAC-Stanford (Feb ’18) 4Network Based Computing Laboratory • Big Data/Enterprise/Commercial Computing – Spark and Hadoop (HDFS, HBase, MapReduce) – Memcached for Web 2.0 – Kafka for Stream Processing • Deep Learning – TensorFlow with gRPC – Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..) • Cloud Support for Big Data Stacks – Virtualization with SR-IOV and Containers – Swift Object Store Focus of Today’s Talk: HPC, BigData, Deep Learning and Cloud
  • 5. HPCAC-Stanford (Feb ’18) 5Network Based Computing Laboratory Big Velocity – How Much Data Is Generated Every Minute on the Internet? The global Internet population grew 7.5% from 2016 and now represents 3.7 Billion People. Courtesy: https://www.domo.com/blog/data-never-sleeps-5/
  • 6. HPCAC-Stanford (Feb ’18) 6Network Based Computing Laboratory • Substantial impact on designing and utilizing data management and processing systems in multiple tiers – Front-end data accessing and serving (Online) • Memcached + DB (e.g. MySQL), HBase – Back-end data analytics (Offline) • HDFS, MapReduce, Spark Data Management and Processing on Modern Clusters
  • 7. HPCAC-Stanford (Feb ’18) 7Network Based Computing Laboratory (1) Prepare Datasets @Scale (2) Deep Learning @Scale (3) Non-deep learning analytics @Scale (4) Apply ML model @Scale • Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms • More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big data stacks, such as Apache Hadoop and Spark • Benefits of the DLoBD approach – Easily build a powerful data analytics pipeline • E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof – Better data locality – Efficient resource sharing and cost effective Deep Learning over Big Data (DLoBD)
  • 8. HPCAC-Stanford (Feb ’18) 8Network Based Computing Laboratory • Scientific Data Management, Analysis, and Visualization • Applications examples – Climate modeling – Combustion – Fusion – Astrophysics – Bioinformatics • Data Intensive Tasks – Runs large-scale simulations on supercomputers – Dump data on parallel storage systems – Collect experimental / observational data – Move experimental / observational data to analysis sites – Visual analytics – help understand data visually Not Only in Internet Services - Big Data in Scientific Domains
  • 9. HPCAC-Stanford (Feb ’18) 9Network Based Computing Laboratory Drivers of Modern HPC Cluster and Data Center Architecture • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) – Single Root I/O Virtualization (SR-IOV) • Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) High Performance Interconnects – InfiniBand (with SR-IOV) <1usec latency, 200Gbps Bandwidth> Multi-/Many-core Processors Cloud CloudSDSC Comet TACC Stampede Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip SSD, NVMe-SSD, NVRAM
  • 10. HPCAC-Stanford (Feb ’18) 10Network Based Computing Laboratory Interconnects and Protocols in OpenFabrics Stack for HPC (http://openfabrics.org) Kernel Space Application / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP 1/10/40/100 GigE InfiniBand Adapter InfiniBand Switch IPoIB IPoIB Ethernet Adapter Ethernet Switch Hardware Offload TCP/IP 10/40 GigE- TOE InfiniBand Adapter InfiniBand Switch User Space RSockets RSockets iWARP Adapter Ethernet Switch TCP/IP User Space iWARP RoCE Adapter Ethernet Switch RDMA User Space RoCE InfiniBand Switch InfiniBand Adapter RDMA User Space IB Native Sockets Application / Middleware Interface Protocol Adapter Switch InfiniBand Adapter InfiniBand Switch RDMA SDP SDP
  • 11. HPCAC-Stanford (Feb ’18) 11Network Based Computing Laboratory How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications? Bring HPC and Big Data processing into a “convergent trajectory”! What are the major bottlenecks in current Big Data processing middleware (e.g. Hadoop, Spark, and Memcached)? Can the bottlenecks be alleviated with new designs by taking advantage of HPC technologies? Can RDMA-enabled high-performance interconnects benefit Big Data processing? Can HPC Clusters with high-performance storage systems (e.g. SSD, parallel file systems) benefit Big Data applications? How much performance benefits can be achieved through enhanced designs? How to design benchmarks for evaluating the performance of Big Data middleware on HPC clusters?
  • 12. HPCAC-Stanford (Feb ’18) 12Network Based Computing Laboratory Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
  • 13. HPCAC-Stanford (Feb ’18) 13Network Based Computing Laboratory Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?
  • 14. HPCAC-Stanford (Feb ’18) 14Network Based Computing Laboratory Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure? Spark Job Hadoop Job Deep Learning Job
  • 15. HPCAC-Stanford (Feb ’18) 15Network Based Computing Laboratory Designing Communication and I/O Libraries for Big Data Systems: Challenges Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Networking Technologies (InfiniBand, 1/10/40/100 GigE and Intelligent NICs) Storage Technologies (HDD, SSD, NVM, and NVMe- SSD) Programming Models (Sockets) Applications Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Other Protocols? Communication and I/O Library Point-to-Point Communication QoS & Fault Tolerance Threaded Models and Synchronization Performance TuningI/O and File Systems Virtualization (SR-IOV) Benchmarks Upper level Changes?
  • 16. HPCAC-Stanford (Feb ’18) 16Network Based Computing Laboratory • RDMA for Apache Spark • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions • RDMA for Apache HBase • RDMA for Memcached (RDMA-Memcached) • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • OSU HiBD-Benchmarks (OHB) – HDFS, Memcached, HBase, and Spark Micro-benchmarks • http://hibd.cse.ohio-state.edu • Users Base: 275 organizations from 34 countries • More than 25,000 downloads from the project site The High-Performance Big Data (HiBD) Project Available for InfiniBand and RoCE Also run on Ethernet
  • 17. HPCAC-Stanford (Feb ’18) 17Network Based Computing Laboratory HiBD Release Timeline and Downloads 0 5000 10000 15000 20000 25000 30000 NumberofDownloads Timeline RDMA-Hadoop2.x0.9.1 RDMA-Hadoop2.x0.9.6 RDMA-Memcached0.9.4 RDMA-Hadoop2.x0.9.7 RDMA-Spark0.9.1 RDMA-HBase0.9.1 RDMA-Hadoop2.x1.0.0 RDMA-Spark0.9.4 RDMA-Hadoop2.x1.3.0 RDMA-Memcached0.9.6&OHB-0.9.3 RDMA-Spark0.9.5 RDMA-Hadoop1.x0.9.0 RDMA-Hadoop1.x0.9.8 RDMA-Memcached0.9.1&OHB-0.7.1 RDMA-Hadoop1.x0.9.9
  • 18. HPCAC-Stanford (Feb ’18) 18Network Based Computing Laboratory • High-Performance Design of Hadoop over RDMA-enabled Interconnects – High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components – Enhanced HDFS with in-memory and heterogeneous storage – High performance design of MapReduce over Lustre – Memcached-based burst buffer for MapReduce over Lustre-integrated HDFS (HHH-L-BB mode) – Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP – Support for OpenPOWER • Current release: 1.3.0 – Based on Apache Hadoop 2.8.0 – Compliant with Apache Hadoop 2.8.0, HDP 2.5.0.3 and CDH 5.8.2 APIs and applications – Tested with • Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) • RoCE support with Mellanox adapters • Various multi-core platforms (x86, POWER) • Different file systems with disks and SSDs and Lustre RDMA for Apache Hadoop 2.x Distribution http://hibd.cse.ohio-state.edu
  • 19. HPCAC-Stanford (Feb ’18) 19Network Based Computing Laboratory • HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package. • HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in- memory and obtain as much performance benefit as possible. • HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster. • HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD. • MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks. • Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH- L, and MapReduce over Lustre). Different Modes of RDMA for Apache Hadoop 2.x
  • 20. HPCAC-Stanford (Feb ’18) 20Network Based Computing Laboratory • High-Performance Design of Spark over RDMA-enabled Interconnects – High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark – RDMA-based data shuffle and SEDA-based shuffle architecture – Non-blocking and chunk-based data transfer – Off-JVM-heap buffer management – Support for OpenPOWER – Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB) • Current release: 0.9.5 – Based on Apache Spark 2.1.0 – Tested with • Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) • RoCE support with Mellanox adapters • Various multi-core platforms (x86, POWER) • RAM disks, SSDs, and HDD – http://hibd.cse.ohio-state.edu RDMA for Apache Spark Distribution
  • 21. HPCAC-Stanford (Feb ’18) 21Network Based Computing Laboratory • High-Performance Design of HBase over RDMA-enabled Interconnects – High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HBase – Compliant with Apache HBase 1.1.2 APIs and applications – On-demand connection setup – Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB) • Current release: 0.9.1 – Based on Apache HBase 1.1.2 – Tested with • Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) • RoCE support with Mellanox adapters • Various multi-core platforms – http://hibd.cse.ohio-state.edu RDMA for Apache HBase Distribution
  • 22. HPCAC-Stanford (Feb ’18) 22Network Based Computing Laboratory • High-Performance Design of Memcached over RDMA-enabled Interconnects – High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Memcached and libMemcached components – High performance design of SSD-Assisted Hybrid Memory – Non-Blocking Libmemcached Set/Get API extensions – Support for burst-buffer mode in Lustre-integrated design of HDFS in RDMA for Apache Hadoop-2.x – Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with IPoIB) • Current release: 0.9.6 – Based on Memcached 1.5.3 and libMemcached 1.0.18 – Compliant with libMemcached APIs and applications – Tested with • Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR) • RoCE support with Mellanox adapters • Various multi-core platforms • SSD – http://hibd.cse.ohio-state.edu RDMA for Memcached Distribution
  • 23. HPCAC-Stanford (Feb ’18) 23Network Based Computing Laboratory • Micro-benchmarks for Hadoop Distributed File System (HDFS) – Sequential Write Latency (SWL) Benchmark, Sequential Read Latency (SRL) Benchmark, Random Read Latency (RRL) Benchmark, Sequential Write Throughput (SWT) Benchmark, Sequential Read Throughput (SRT) Benchmark – Support benchmarking of • Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS, Cloudera Distribution of Hadoop (CDH) HDFS • Micro-benchmarks for Memcached – Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark, Non-Blocking API Latency Benchmark, Hybrid Memory Latency Benchmark – Yahoo! Cloud Serving Benchmark (YCSB) Extension for RDMA-Memcached • Micro-benchmarks for HBase – Get Latency Benchmark, Put Latency Benchmark • Micro-benchmarks for Spark – GroupBy, SortBy • Current release: 0.9.3 • http://hibd.cse.ohio-state.edu OSU HiBD Micro-Benchmark (OHB) Suite – HDFS, Memcached, HBase, and Spark
  • 24. HPCAC-Stanford (Feb ’18) 24Network Based Computing Laboratory Using HiBD Packages on Existing HPC Infrastructure
  • 25. HPCAC-Stanford (Feb ’18) 25Network Based Computing Laboratory Using HiBD Packages on Existing HPC Infrastructure
  • 26. HPCAC-Stanford (Feb ’18) 26Network Based Computing Laboratory • RDMA for Apache Hadoop 2.x and RDMA for Apache Spark are installed and available on SDSC Comet. – Examples for various modes of usage are available in: • RDMA for Apache Hadoop 2.x: /share/apps/examples/HADOOP • RDMA for Apache Spark: /share/apps/examples/SPARK/ – Please email help@xsede.org (reference Comet as the machine, and SDSC as the site) if you have any further questions about usage and configuration. • RDMA for Apache Hadoop is also available on Chameleon Cloud as an appliance – https://www.chameleoncloud.org/appliances/17/ HiBD Packages on SDSC Comet and Chameleon Cloud M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016
  • 27. HPCAC-Stanford (Feb ’18) 27Network Based Computing Laboratory • Basic Designs for HiBD Packages – HDFS and MapReduce – Spark – Memcached – Kafka • Deep Learning – TensorFlow with gRPC – Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..) • Cloud Support for Big Data Stacks – Virtualization with SR-IOV and Containers – Swift Object Store Acceleration Case Studies and Performance Evaluation
  • 28. HPCAC-Stanford (Feb ’18) 28Network Based Computing Laboratory • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Java based HDFS with communication library written in native code Design Overview of HDFS with RDMA HDFS Verbs RDMA Capable Networks (IB, iWARP, RoCE ..) Applications 1/10/40/100 GigE, IPoIB Network Java Socket Interface Java Native Interface (JNI) WriteOthers OSU Design • Design Features – RDMA-based HDFS write – RDMA-based HDFS replication – Parallel replication support – On-demand connection setup – InfiniBand/RoCE support N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012 N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014
  • 29. HPCAC-Stanford (Feb ’18) 29Network Based Computing Laboratory Triple-H Heterogeneous Storage • Design Features – Three modes • Default (HHH) • In-Memory (HHH-M) • Lustre-Integrated (HHH-L) – Policies to efficiently utilize the heterogeneous storage devices • RAM, SSD, HDD, Lustre – Eviction/Promotion based on data usage pattern – Hybrid Replication – Lustre-Integrated mode: • Lustre-based fault-tolerance Enhanced HDFS with In-Memory and Heterogeneous Storage Hybrid Replication Data Placement Policies Eviction/Promotion RAM Disk SSD HDD Lustre N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015 Applications
  • 30. HPCAC-Stanford (Feb ’18) 30Network Based Computing Laboratory Design Overview of MapReduce with RDMA MapReduce Verbs RDMA Capable Networks (IB, iWARP, RoCE ..) OSU Design Applications 1/10/40/100 GigE, IPoIB Network Java Socket Interface Java Native Interface (JNI) Job Tracker Task Tracker Map Reduce • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Java based MapReduce with communication library written in native code • Design Features – RDMA-based shuffle – Prefetching and caching map output – Efficient Shuffle Algorithms – In-memory merge – On-demand Shuffle Adjustment – Advanced overlapping • map, shuffle, and merge • shuffle, merge, and reduce – On-demand connection setup – InfiniBand/RoCE support M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014
  • 31. HPCAC-Stanford (Feb ’18) 31Network Based Computing Laboratory 0 50 100 150 200 250 300 350 400 80 120 160 ExecutionTime(s) Data Size (GB) IPoIB (EDR) OSU-IB (EDR) 0 100 200 300 400 500 600 700 800 80 160 240 ExecutionTime(s) Data Size (GB) IPoIB (EDR) OSU-IB (EDR) Performance Numbers of RDMA for Apache Hadoop 2.x – RandomWriter & TeraGen in OSU-RI2 (EDR) Cluster with 8 Nodes with a total of 64 maps • RandomWriter – 3x improvement over IPoIB for 80-160 GB file size • TeraGen – 4x improvement over IPoIB for 80-240 GB file size RandomWriter TeraGen Reduced by 3x Reduced by 4x
  • 32. HPCAC-Stanford (Feb ’18) 32Network Based Computing Laboratory 0 100 200 300 400 500 600 700 800 80 120 160 ExecutionTime(s) Data Size (GB) IPoIB (EDR) OSU-IB (EDR) Performance Numbers of RDMA for Apache Hadoop 2.x – Sort & TeraSort in OSU-RI2 (EDR) Cluster with 8 Nodes with a total of 64 maps and 32 reduces • Sort – 61% improvement over IPoIB for 80-160 GB data • TeraSort – 18% improvement over IPoIB for 80-240 GB data Reduced by 61% Reduced by 18% Cluster with 8 Nodes with a total of 64 maps and 14 reduces Sort TeraSort 0 100 200 300 400 500 600 80 160 240 ExecutionTime(s) Data Size (GB) IPoIB (EDR) OSU-IB (EDR)
  • 33. HPCAC-Stanford (Feb ’18) 33Network Based Computing Laboratory Evaluation with Spark on SDSC Gordon (HHH vs. Tachyon/Alluxio) • For 200GB TeraGen on 32 nodes – Spark-TeraGen: HHH has 2.4x improvement over Tachyon; 2.3x over HDFS-IPoIB (QDR) – Spark-TeraSort: HHH has 25.2% improvement over Tachyon; 17% over HDFS-IPoIB (QDR) 0 20 40 60 80 100 120 140 160 180 8:50 16:100 32:200 ExecutionTime(s) Cluster Size : Data Size (GB) IPoIB (QDR) Tachyon OSU-IB (QDR) 0 100 200 300 400 500 600 700 8:50 16:100 32:200 ExecutionTime(s) Cluster Size : Data Size (GB) Reduced by 2.4x Reduced by 25.2% TeraGen TeraSort N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015
  • 34. HPCAC-Stanford (Feb ’18) 34Network Based Computing Laboratory • Basic Designs for HiBD Packages – HDFS and MapReduce – Spark – Memcached – Kafka • Deep Learning – TensorFlow with gRPC – Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..) • Cloud Support for Big Data Stacks – Virtualization with SR-IOV and Containers – Swift Object Store Acceleration Case Studies and Performance Evaluation
  • 35. HPCAC-Stanford (Feb ’18) 35Network Based Computing Laboratory • Design Features – RDMA based shuffle plugin – SEDA-based architecture – Dynamic connection management and sharing – Non-blocking data transfer – Off-JVM-heap buffer management – InfiniBand/RoCE support Design Overview of Spark with RDMA • Enables high performance RDMA communication, while supporting traditional socket interface • JNI Layer bridges Scala based Spark with communication library written in native code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014 X. Lu, D. Shankar, S. Gugnani, and D. K. Panda, High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads, IEEE BigData ‘16, Dec. 2016. Spark Core RDMA Capable Networks (IB, iWARP, RoCE ..) Apache Spark Benchmarks/Applications/Libraries/Frameworks 1/10/40/100 GigE, IPoIB Network Java Socket Interface Java Native Interface (JNI) Native RDMA-based Comm. Engine Shuffle Manager (Sort, Hash, Tungsten-Sort) Block Transfer Service (Netty, NIO, RDMA-Plugin) Netty Server NIO Server RDMA Server Netty Client NIO Client RDMA Client
  • 36. HPCAC-Stanford (Feb ’18) 36Network Based Computing Laboratory • InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R) • RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node. – SortBy: Total time reduced by up to 80% over IPoIB (56Gbps) – GroupBy: Total time reduced by up to 74% over IPoIB (56Gbps) Performance Evaluation on SDSC Comet – SortBy/GroupBy 64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time 0 50 100 150 200 250 300 64 128 256 Time(sec) Data Size (GB) IPoIB RDMA 0 50 100 150 200 250 64 128 256 Time(sec) Data Size (GB) IPoIB RDMA 74%80%
  • 37. HPCAC-Stanford (Feb ’18) 37Network Based Computing Laboratory • InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) • RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) – 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) Performance Evaluation on SDSC Comet – HiBench PageRank 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time 0 50 100 150 200 250 300 350 400 450 Huge BigData Gigantic Time(sec) Data Size (GB) IPoIB RDMA 0 100 200 300 400 500 600 700 800 Huge BigData Gigantic Time(sec) Data Size (GB) IPoIB RDMA 43%37%
  • 38. HPCAC-Stanford (Feb ’18) 38Network Based Computing Laboratory Application Evaluation on SDSC Comet • Kira Toolkit: Distributed astronomy image processing toolkit implemented using Apache Spark – https://github.com/BIDS/Kira • Source extractor application, using a 65GB dataset from the SDSS DR2 survey that comprises 11,150 image files. 0 20 40 60 80 100 120 RDMA Spark Apache Spark (IPoIB) 21 % Execution times (sec) for Kira SE benchmark using 65 GB dataset, 48 cores. M. Tatineni, X. Lu, D. J. Choi, A. Majumdar, and D. K. Panda, Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet, XSEDE’16, July 2016 0 200 400 600 800 1000 24 48 96 192 384 OneEpochTime(sec) Number of cores IPoIB RDMA • BigDL: Distributed Deep Learning Tool using Apache Spark – https://github.com/intel-analytics/BigDL • VGG training model on the CIFAR-10 dataset 4.58x
  • 39. HPCAC-Stanford (Feb ’18) 39Network Based Computing Laboratory • Challenges – Performance characteristics of RDMA-based Hadoop and Spark over OpenPOWER systems with IB networks – Can RDMA-based Big Data middleware demonstrate good performance benefits on OpenPOWER systems? – Any new accelerations for Big Data middleware on OpenPOWER systems to attain better benefits? • Results – For the Spark SortBy benchmark, RDMA-Opt outperforms IPoIB and RDMA-Def by 23.5% and 10.5% Accelerating Hadoop and Spark with RDMA over OpenPOWER 0 20 40 60 80 100 120 140 160 180 200 32GB 64GB 96GB Latency(sec) IPoIB RDMA-Def RDMA-Opt Spark SortBy Reduced by 23.5% Socket 0 Chip Interconnect Memory Controller(s) Connect to DDR4 Memory PCIe Interface Connect to IB HCA Socket 1 (SMT) Core (SMT) Core (SMT) Core (SMT) Core (SMT) Core (SMT) Core Chip Interconnect (SMT) Core (SMT) Core (SMT) Core (SMT) Core (SMT) Core (SMT) Core Memory Controller(s) Connect to DDR4 Memory PCIe Interface Connect to IB HCA IBM POWER8 Architecture X. Lu, H. Shi, D. Shankar and D. K. Panda, Performance Characterization and Acceleration of Big Data Workloads on OpenPOWER System, IEEE BigData 2017.
  • 40. HPCAC-Stanford (Feb ’18) 40Network Based Computing Laboratory • Basic Designs for HiBD Packages – HDFS and MapReduce – Spark – Memcached – Kafka • Deep Learning – TensorFlow with gRPC – Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..) • Cloud Support for Big Data Stacks – Virtualization with SR-IOV and Containers – Swift Object Store Acceleration Case Studies and Performance Evaluation
  • 41. HPCAC-Stanford (Feb ’18) 41Network Based Computing Laboratory 1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time(us) Message Size OSU-IB (FDR) 0 200 400 600 800 16 32 64 128 256 512 102420484080 Thousandsof Transactionsper Second(TPS) No. of Clients • Memcached Get latency – 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us, 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us • Memcached Throughput (4bytes) – 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s, Nearly 2X improvement in throughput Memcached GET Latency Memcached Throughput Memcached Performance (FDR Interconnect) Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) Latency Reduced by nearly 20X 2X J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11 J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12
  • 42. HPCAC-Stanford (Feb ’18) 42Network Based Computing Laboratory • Illustration with Read-Cache-Read access pattern using modified mysqlslap load testing tool • Memcached-RDMA can - improve query latency by up to 66% over IPoIB (32Gbps) - throughput by up to 69% over IPoIB (32Gbps) Micro-benchmark Evaluation for OLDP workloads 0 1 2 3 4 5 6 7 8 64 96 128 160 320 400 Latency(sec) No. of Clients Memcached-IPoIB (32Gbps) Memcached-RDMA (32Gbps) 0 1000 2000 3000 4000 64 96 128 160 320 400 Throughput(Kq/s) No. of Clients Memcached-IPoIB (32Gbps) Memcached-RDMA (32Gbps) D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-Line Data Processing Workloads with Memcached and MySQL, ISPASS’15 Reduced by 66%
  • 43. HPCAC-Stanford (Feb ’18) 43Network Based Computing Laboratory – Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve • >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty • >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid – Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem Performance Evaluation with Non-Blocking Memcached API 0 500 1000 1500 2000 2500 Set Get Set Get Set Get Set Get Set Get Set Get IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB- i H-RDMA-Opt-NonB- b AverageLatency(us) MissPenalty(BackendDBAccessOverhead) ClientWait ServerResponse CacheUpdate Cachecheck+Load(Memoryand/orSSDread) SlabAllocation(w/SSDwriteonOut-of-Mem) H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee
  • 44. HPCAC-Stanford (Feb ’18) 44Network Based Computing Laboratory 0 200 400 600 800 1000 100 1000 10000 100000 Throughput(MB/s) Record Size 0 500 1000 1500 2000 2500 100 1000 10000 100000 Latency(ms) Record Size RDMA IPoIB RDMA-Kafka: High-Performance Message Broker for Streaming Workloads Kafka Producer Benchmark (1 Broker 4 Producers) • Experiments run on OSU-RI2 cluster • 2.4GHz 28 cores, InfiniBand EDR, 512 GB RAM, 400GB SSD – Up to 98% improvement in latency compared to IPoIB – Up to 7x increase in throughput over IPoIB 98% 7x H. Javed, X. Lu , and D. K. Panda, Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka, BDCAT’17
  • 45. HPCAC-Stanford (Feb ’18) 45Network Based Computing Laboratory • Basic Designs for HiBD Packages – HDFS and MapReduce – Spark – Memcached – Kafka • Deep Learning – TensorFlow with gRPC – Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..) • Cloud Support for Big Data Stacks – Virtualization with SR-IOV and Containers – Swift Object Store Acceleration Case Studies and Performance Evaluation
  • 46. HPCAC-Stanford (Feb ’18) 46Network Based Computing Laboratory Overview of RDMA-gRPC with TensorFlow Worker services communicate among each other using RDMA-gRPC Client Master Worker /job:PS/task:0 Worker /job:Worker/task:0 CPU GPU RDMA gRPC server/ client CPU GPU RDMA gRPC server/ client
  • 47. HPCAC-Stanford (Feb ’18) 47Network Based Computing Laboratory Performance Benefits for RDMA-gRPC with Micro-Benchmark RDMA-gRPC RPC Latency • AR-gRPC (OSU design) Latency on SDSC-Comet-FDR – Up to 2.7x performance speedup over IPoIB for Latency for small messages – Up to 2.8x performance speedup over IPoIB for Latency for medium messages – Up to 2.5x performance speedup over IPoIB for Latency for large messages 0 15 30 45 60 75 90 2 8 32 128 512 2K 8K Latency(us) payload (Bytes) Default gRPC AR-gRPC 0 200 400 600 800 1000 16K 32K 64K 128K 256K 512KLatency(us) Payload (Bytes) Default gRPC AR-gRPC 100 3800 7500 11200 14900 18600 1M 2M 4M 8M Latency(us) Payload (Bytes) Default gRPC AR-gRPC R. Biswas, X. Lu, and D. K. Panda, Characterizing and Accelerating TensorFlow: Can Adaptive and RDMA-based RPC Benefit It? Under Review
  • 48. HPCAC-Stanford (Feb ’18) 48Network Based Computing Laboratory Performance Benefit for TensorFlow • TensorFlow performance evaluation on an IB EDR cluster – Up to 19% performance speedup over IPoIB for Sigmoid net (20 epochs) – Up to 35% and 31% performance speedup over IPoIB for resnet50 and Inception3 (batch size 8) sigmoid net CNN 0 10 20 30 40 50 4 Nodes (1 PS, 3 Workers) Time(Seconds),Lowerbetter Default gRPC gRPC + Verbs gRPC + MPI AR-gRPC 0 10 20 30 40 50 60 70 80 resnet50 inception3 Images/Second,HigherBetter 4 Nodes (1 PS, 3 Workers)Default gRPC gRPC + Verbs gRPC + MPI AR-gRPC 19% 35% 31%
  • 49. HPCAC-Stanford (Feb ’18) 49Network Based Computing Laboratory • Basic Designs for HiBD Packages – HDFS and MapReduce – Spark – Memcached – Kafka • Deep Learning – TensorFlow with gRPC – Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..) • Cloud Support for Big Data Stacks – Virtualization with SR-IOV and Containers – Swift Object Store Acceleration Case Studies and Performance Evaluation
  • 50. HPCAC-Stanford (Feb ’18) 50Network Based Computing Laboratory • Layers of DLoBD Stacks – Deep learning application layer – Deep learning library layer – Big data analytics framework layer – Resource scheduler layer – Distributed file system layer – Hardware resource layer • How much performance benefit we can achieve for end deep learning applications? Overview of DLoBD Stacks Sub-optimal Performance ?
  • 51. HPCAC-Stanford (Feb ’18) 51Network Based Computing Laboratory Performance Evaluation for IPoIB and RDMA with CaffeOnSpark and TensorFlowOnSpark (IB EDR) • CaffeOnSpark benefits from the high performance of RDMA compared to IPoIB once communication overhead becomes significant. • Our experiments show that the default RDMA design in TensorFlowOnSpark is not fully optimized yet. For MNIST tests, RDMA is not showing obvious benefits. 0 50 100 150 200 250 300 350 400 450 IPoIB RDMA IPoIB RDMA IPoIB RDMA IPoIB RDMA GPU GPU-cuDNN GPU GPU-cuDNN CIFAR-10 Quick LeNet on MNIST End-to-EndTime(secs) CaffeOnSpark 2 4 8 16 0 50 100 150 200 250 300 350 400 450 IPoIB RDMA IPoIB RDMA CIFAR10 (2 GPUs/node) MNIST (1 GPU/node) End-to-EndTime(secs) TensorFlowOnSpark 2 GPUs 4 GPUs 8 GPUs 14.2% 13.3% 51.2% 45.6% 33.01% 4.9%
  • 52. HPCAC-Stanford (Feb ’18) 52Network Based Computing Laboratory Epoch-Level Evaluation with BigDL on SDSC Comet 0 10 20 30 40 50 60 10 510 1010 1510 2010 2510 3010 3510 4010 4510 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Accuracy(%) AccumulativeEpochsTime(secs) Epoch Number IPoIB-Time RDMA-Time IPoIB-Accuracy RDMA-Accuracy VGG on CIFAR-10 with BigDL • Epoch-level evaluation of training VGG model using BigDL on default Spark with IPoIB and our RDMA-based Spark. • RDMA version takes constantly less time than the IPoIB version to finish every epoch. – RDMA finishes epoch 18 in 2.6x time faster than IPoIB 2.6X
  • 53. HPCAC-Stanford (Feb ’18) 53Network Based Computing Laboratory • Basic Designs for HiBD Packages – HDFS and MapReduce – Spark – Memcached – Kafka • Deep Learning – TensorFlow with gRPC – Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..) • Cloud Support for Big Data Stacks – Virtualization with SR-IOV and Containers – Swift Object Store Acceleration Case Studies and Performance Evaluation
  • 54. HPCAC-Stanford (Feb ’18) 54Network Based Computing Laboratory Overview of RDMA-Hadoop-Virt Architecture • Virtualization-aware modules in all the four main Hadoop components: – HDFS: Virtualization-aware Block Management to improve fault-tolerance – YARN: Extensions to Container Allocation Policy to reduce network traffic – MapReduce: Extensions to Map Task Scheduling Policy to reduce network traffic – Hadoop Common: Topology Detection Module for automatic topology detection • Communications in HDFS, MapReduce, and RPC go through RDMA-based designs over SR-IOV enabled InfiniBand HDFS YARN HadoopCommon MapReduce HBase Others Virtual Machines Bare-Metal nodesContainers Big Data ApplicationsTopologyDetectionModule Map Task Scheduling Policy Extension Container Allocation Policy Extension CloudBurst MR-MS Polygraph Others Virtualization Aware Block Management S. Gugnani, X. Lu, D. K. Panda. Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-enabled Clouds. CloudCom, 2016.
  • 55. HPCAC-Stanford (Feb ’18) 55Network Based Computing Laboratory Evaluation with Applications – 14% and 24% improvement with Default Mode for CloudBurst and Self-Join – 30% and 55% improvement with Distributed Mode for CloudBurst and Self-Join 0 20 40 60 80 100 Default Mode Distributed Mode EXECUTIONTIME CloudBurst RDMA-Hadoop RDMA-Hadoop-Virt 0 50 100 150 200 250 300 350 400 Default Mode Distributed Mode EXECUTIONTIME Self-Join RDMA-Hadoop RDMA-Hadoop-Virt 30% reduction 55% reduction Will be released in future
  • 56. HPCAC-Stanford (Feb ’18) 56Network Based Computing Laboratory • Basic Designs for HiBD Packages – HDFS and MapReduce – Spark – Memcached – Kafka • Deep Learning – TensorFlow with gRPC – Deep Learning over Big Data (Caffe on Spark, Tensorflow on Spark, ..) • Cloud Support for Big Data Stacks – Virtualization with SR-IOV and Containers – Swift Object Store Acceleration Case Studies and Performance Evaluation
  • 57. HPCAC-Stanford (Feb ’18) 57Network Based Computing Laboratory • Distributed Cloud-based Object Storage Service • Deployed as part of OpenStack installation • Can be deployed as standalone storage solution as well • Worldwide data access via Internet – HTTP-based • Architecture – Multiple Object Servers: To store data – Few Proxy Servers: Act as a proxy for all requests – Ring: Handles metadata • Usage – Input/output source for Big Data applications (most common use case) – Software/Data backup – Storage of VM/Docker images • Based on traditional TCP sockets communication OpenStack Swift Overview Send PUT or GET request PUT/GET /v1/<account>/<container>/<object> Proxy Server Object Server Object Server Object Server Ring Disk 1 Disk 2 Disk 1 Disk 2 Disk 1 Disk 2 Swift Architecture
  • 58. HPCAC-Stanford (Feb ’18) 58Network Based Computing Laboratory • Challenges – Proxy server is a bottleneck for large scale deployments – Object upload/download operations network intensive – Can an RDMA-based approach benefit? • Design – Re-designed Swift architecture for improved scalability and performance; Two proposed designs: • Client-Oblivious Design: No changes required on the client side • Metadata Server-based Design: Direct communication between client and object servers; bypass proxy server – RDMA-based communication framework for accelerating networking performance – High-performance I/O framework to provide maximum overlap between communication and I/O Swift-X: Accelerating OpenStack Swift with RDMA for Building Efficient HPC Clouds S. Gugnani, X. Lu, and D. K. Panda, Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud, accepted at CCGrid’17, May 2017 Client-Oblivious Design (D1) Metadata Server-based Design (D2)
  • 59. HPCAC-Stanford (Feb ’18) 59Network Based Computing Laboratory 0 5 10 15 20 25 Swift PUT Swift-X (D1) PUT Swift-X (D2) PUT Swift GET Swift-X (D1) GET Swift-X (D2) GET LATENCY(S) TIME BREAKUP OF GET AND PUT Communication I/O Hashsum Other Swift-X: Accelerating OpenStack Swift with RDMA for Building Efficient HPC Clouds 0 5 10 15 20 1MB 4MB 16MB 64MB 256MB 1GB 4GB LATENCY(s) OBJECT SIZE GET LATENCY EVALUATION Swift Swift-X (D2) Swift-X (D1) Reduced by 66% • Up to 66% reduction in GET latency• Communication time reduced by up to 3.8x for PUT and up to 2.8x for GET
  • 60. HPCAC-Stanford (Feb ’18) 60Network Based Computing Laboratory • Discussed challenges in accelerating BigData and Deep Learning Stacks with HPC technologies • Presented basic and advanced designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, HBase, Spark, Memcached, gRPC, and TensorFlow on HPC clusters and clouds • Results are promising for Big Data processing and the associated Deep Learning tools • Many other open issues need to be solved • Will enable Big Data and Deep Learning communities to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner Concluding Remarks
  • 61. HPCAC-Stanford (Feb ’18) 61Network Based Computing Laboratory Funding Acknowledgments Funding Support by Equipment Support by
  • 62. HPCAC-Stanford (Feb ’18) 62Network Based Computing Laboratory Personnel Acknowledgments Current Students (Graduate) – A. Awan (Ph.D.) – R. Biswas (M.S.) – M. Bayatpour (Ph.D.) – S. Chakraborthy (Ph.D.) – C.-H. Chu (Ph.D.) – S. Guganani (Ph.D.) Past Students – A. Augustine (M.S.) – P. Balaji (Ph.D.) – S. Bhagvat (M.S.) – A. Bhat (M.S.) – D. Buntinas (Ph.D.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – R. Rajachandrasekar (Ph.D.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) Past Research Scientist – K. Hamidouche – S. Sur Past Post-Docs – D. Banerjee – X. Besseron – H.-W. Jin – W. Huang (Ph.D.) – W. Jiang (M.S.) – J. Jose (Ph.D.) – S. Kini (M.S.) – M. Koop (Ph.D.) – K. Kulkarni (M.S.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – M. Li (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – A. Moody (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – S. Potluri (Ph.D.) – J. Hashmi (Ph.D.) – H. Javed (Ph.D.) – P. Kousha (Ph.D.) – D. Shankar (Ph.D.) – H. Shi (Ph.D.) – J. Zhang (Ph.D.) – J. Lin – M. Luo – E. Mancini Current Research Scientists – X. Lu – H. Subramoni Past Programmers – D. Bureddy – J. Perkins Current Research Specialist – J. Smith – M. Arnold – S. Marcarelli – J. Vienne – H. Wang Current Post-doc – A. Ruhela Current Students (Undergraduate) – N. Sarkauskas (B.S.)
  • 63. HPCAC-Stanford (Feb ’18) 63Network Based Computing Laboratory The 4th International Workshop on High-Performance Big Data Computing (HPBDC) HPBDC 2018 will be held with IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018), Vancouver, British Columbia CANADA, May, 2018 http://web.cse.ohio-state.edu/~luxi/hpbdc2018 HPBDC 2017 was held in conjunction with IPDPS’17 Keynote Talk: Prof. Satoshi Matsuoka, Converging HPC and Big Data / AI Infrastructures at Scale with BYTES-Oriented Architectures Panel Topic: Sunrise or Sunset: Exploring the Design Space of Big Data Software Stack Six Regular Research Papers and One Short Research Papers http://web.cse.ohio-state.edu/~luxi/hpbdc2017 HPBDC 2016 was held in conjunction with IPDPS’16 Keynote Talk: Dr. Chaitanya Baru, High Performance Computing and Which Big Data? Panel Topic: Merge or Split: Mutual Influence between Big Data and HPC Techniques Six Regular Research Papers and Two Short Research Papers http://web.cse.ohio-state.edu/~luxi/hpbdc2016
  • 64. HPCAC-Stanford (Feb ’18) 64Network Based Computing Laboratory panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Thank You! Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

Notas do Editor

  1. DLoBD approach can easily integrate deep learning components with other big data processing components in the whole workflow (1) Prepare the dataset for training (2) DL at Scale: Extract high-level features (3) Non-DL at Scale: Traditional classification analytics (4) Apply models for billions of pictures
  2. 1. Modern processors have hardware-based virtualization support 2. Multi-core processors and large memory nodes have enabled a large number of VMs to be deployed on a single node 3. HPC Clouds are often deployed with InfiniBand with SR-IOV support 4. They also have SSDs and Object Storage Clusters such as OpenStack Swift which often use SSDs for backend storage 5. Many large-scale cloud deployments such as Microsoft Azure, Softlayer (an IBM company), Oracle Cloud, and Chameleon Cloud provide support for InfiniBand and SR-IOV 6. In fact, all our evaluations are done on Chameleon Cloud
  3. OmniPath can be added into the figure!
  4. So far, by default it’s hybrid replication, two in ram, one in ssd/disk
  5. new
  6. new
  7. Tachyon-Cache mode Write-Through-Cache doesn’t work well for TeraSort
  8. High performance and Ecosystem compatibility
  9. For the latency graph, it is 1-client and 1-server configuration. And for the scalability, it is 4 servers. All these experiments were with default memory configuration (64MB). I had used the TPS benchmark for the scalability experiments.
  10. Up to 40 nodes, 10 concurrent threads per node, 20 queries per client
  11. Numbers taken using the provided Kafka Producer Performance benchmark. Emits records to Kafka as fast as it possibly can. Latency of a record is calculated as the time it takes for a record to be sent from the Producer to the Broker and its acknowledgment to be sent by the Broker and received at the Producer.
  12. On CaffeOnSpark, the performance of 16 nodes with GPU and GPU + cuDNN over RDMA is improved by 14.2% and 13.3%, respectively, in training CIFAR-10 Quick model, while 51.2% and 45.6%, respectively, for training LeNet model. On TensorFlowOnSpark, RDMA outperforms IPoIB by 53.8% for 8 GPUs in training CIFAR-10 model. However, in training MNIST, RDMA is 4.9% faster for 2 GPUs and worse than IPoIB for 4 GPUs. For TensorFlowOnSpark CIFAR experiments the batch size is set to 32 whereas for MNIST it is set to 128.
  13. For Epoch level evaluation, total number of CPU cores used is 192, and the batch size is 768