SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Leveraging Open Source for Large Scale
Analytics on HPC Systems
Rob Vesse, Software Engineer, Cray Inc
C O M P U T E | S T O R E | A N A L Y Z E
Overview
● Background
● Challenges
● Packaging and Deployment
● Input/Output
● Scaling Analytics
● Python Data Science
● Machine Learning
Slides: https://cray.box.com/v/sw-data-july-2018
Copyright Cray Inc 2018
2
C O M P U T E | S T O R E | A N A L Y Z E
Legal Disclaimer
Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to
any intellectual property rights is granted by this document.
Cray Inc. may make changes to specifications and product descriptions at any time, without notice.
All products, dates and figures specified are preliminary based on current expectations, and are subject to
change without notice.
Cray hardware and software products may contain design defects or errors known as errata, which may cause
the product to deviate from published specifications. Current characterized errata are available on request.
Cray uses codenames internally to identify products that are in development and not yet publically announced
for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising,
promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user.
Performance tests and ratings are measured using specific systems and/or components and reflect the
approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware
or software design or configuration may affect actual performance.
The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and
design, SONEXION, and URIKA. The following are trademarks of Cray Inc.: APPRENTICE2, CHAPEL,
CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, REVEAL,
THREADSTORM. The following system family marks, and associated model number marks, are trademarks of
Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a
sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other
trademarks used in this document are the property of their respective owners.
Copyright Cray Inc 2018
3
C O M P U T E | S T O R E | A N A L Y Z E
Background
● About Me
● Software Engineer in the Analytics R&D Group
● Develop hardware and software solutions across Cray's product portfolio
● Primarily focused on integrating open source software into a coherent user friendly
product
● Involved in open source for ~15 years, committer at Apache Software Foundation
since 2012, and member since 2015
● Definition - High Performance Computing (HPC)
● Any sufficiently large high performance computer
● Typically $500,000 dollars plus
● As small as 10s of nodes up to 10,000s of nodes
● Creates some interesting scaling and implementation challenges for analytics
● Why analytics on HPC Systems?
● Scale
● Productivity
● Utilization
Copyright Cray Inc 2018
4
C O M P U T E | S T O R E | A N A L Y Z E
Packaging and Deployment
● Challenges
● HPC Systems are highly
controlled environments
● Users are granted the
minimum permissions
possible
● Many open source packages
have extensive dependencies
or expect users to bring in
their own
Copyright Cray Inc 2018
5
C O M P U T E | S T O R E | A N A L Y Z E
Solution - Containers
● An easy solution right?
● HPC Sysadmins are really paranoid
● Docker still considered insecure by many
● NERSC Shifter
● A HPC centric containerizer, used on our top end systems
● Designed to scale out massively
● Forces containerized process to run as the launching users UID
● Can consume Docker images but has own image gateway and
format
● Docker
● Currently used for our cluster systems
● Eventually will be used on our next generation supercomputers
Copyright Cray Inc 2018
6
C O M P U T E | S T O R E | A N A L Y Z E
Containers - Shifter vs Docker
● Both are open source so why choose Docker?
● https://github.com/NERSC/shifter
● https://github.com/docker
● Docker has a far more vibrant community
● Many of its shortcomings for HPC have or are being addressed
● E.g. Container access to hardware devices like GPUs
● NVidia Docker - https://github.com/NVIDIA/nvidia-docker
● It's Open Container Initiative (OCI) compliant
● Docker can be used with other key technologies e.g.
Kubernetes
Copyright Cray Inc 2018
7
C O M P U T E | S T O R E | A N A L Y Z E
Orchestration
● For distributed applications we need something to tie the
containers together
● Also want to support multi-tenant isolation
● Kubernetes
● Fastest growing container orchestrator out there
● Open APIs and highly extensible
● Declaratively specify complex applications and self-service
configuration via APIs
● E.g. Deploying Apache Spark on Kubernetes using Bloomberg's
Kerberos support mods
● Biggest problem for us is networking!
Copyright Cray Inc 2018
8
C O M P U T E | S T O R E | A N A L Y Z E
Kubernetes Cluster Networking
● Kubernetes has a networking model that supports
customizable network providers
● Differing capabilities, bare networking through to network
traffic policy management
● E.g. isolated Tenant A from Tenant B
● Different providers use different approaches e.g.
● Flannel and Weave use VXLAN
● Cilium uses eBPF
● Calico and Romana uses static routing
● Our Aries network doesn't support VLANs and our kernel
doesn't support eBPF!
● Therefore we chose Romana
Copyright Cray Inc 2018
9
C O M P U T E | S T O R E | A N A L Y Z E
Input/Output Challenges
● Lots of analytics
frameworks e.g. Apache
Hadoop Map/Reduce,
Apache Spark rely on local
storage
● E.g. temporary scratch space
● BUT many HPC systems
have no local storage
Map task
thread
Block
manager
Disk
Reduce
task
threadRequest
TCP
Spark
Scheduler
Shuffle write
Shuffle read
Meta data
Copyright Cray Inc 2018
10
C O M P U T E | S T O R E | A N A L Y Z E
Virtual Local Storage
● tmpfs/ramfs
● Standard temporary file system for *nix OSes
● Stored in RAM
● tmpfs is preferred as can be specified with a max size
● BUT competes with your analytics frameworks for memory
● Use the systems parallel file system e.g. Lustre
● Unfortunately these aren't designed for small file IO
● Deadlocks the metadata servers causing significant slowdown for
everyone!
● Using Linux loopback mounts to solve this
● Short lived files never leave OS disk cache i.e. still in memory
● OS can flush OS disk cache as needed
Copyright Cray Inc 2018
11
C O M P U T E | S T O R E | A N A L Y Z E
Python Data Science
● Challenges
● Managing dependencies
● Compute nodes typically have
no external network
connectivity
● Distributed computation
● Maximising hardware
utilization for performance
Copyright Cray Inc 2018
12
C O M P U T E | S T O R E | A N A L Y Z E
Dependency Management
● Using Anaconda to solve this
● Have to resolve the environments up front
● Compute nodes can't access external network
● Also need to project environments onto compute nodes
as needed
● For containers use volume mounts and environment variable
injection into the container
● For standard jobs need to store environments on a file system
visible to compute nodes
Copyright Cray Inc 2018
13
C O M P U T E | S T O R E | A N A L Y Z E
Distributed Computation - Dask
● Distributed work
scheduling library for
Python
● Integrates with
common data science
libraries
● Numpy, Pandas,
SciKit-Learn
● Familiar Pythonic API
for scaling out
workloads
● Can be installed as part
of the Conda
environment
>>> from dask.distributed import Client
>>> client =
Client(scheduler_file='/path/to/scheduler.json')
>>> def square(x):
return x ** 2
>>> def neg(x):
return -x
>>> A = client.map(square, range(10))
>>> B = client.map(neg, A)
>>> total = client.submit(sum, B)
>>> total # Function hasn't yet completed
<Future: status: waiting, key: sum-
58999c52e0fa35c7d7346c098f5085c7>
>>> total.result() -285
>>> client.gather(A)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Copyright Cray Inc 2018
14
C O M P U T E | S T O R E | A N A L Y Z E
Dask - Scheduler & Environment Setup
● Using Dask requires running scheduler and worker
processes on our compute resources
● We don't necessarily know the set of physical nodes we will get
ahead of time
● Dask provides a scheduler file mechanism for this
● Need to start a scheduler and worker on each physical
node
● We use the entry point scripts of our container images to do this
● Also need to integrate with users Conda environment
● MUST activate the volume mounted environments prior to
starting Dask
Copyright Cray Inc 2018
15
C O M P U T E | S T O R E | A N A L Y Z E
Maximising Performance
● To fully take advantage of HPC hardware need to use
appropriately optimized libraries
● Option 1 - Custom Anaconda Channels
● E.g. Intel Distribution for Python
● Uses Intel AVX and MKL (Math Kernel Library) underneath popular
libraries
● Option 2 - ABI Injection
● Where a library uses a defined ABI e.g. mpi4py ensure it is
compiled against the generic ABI
● At runtime use volume mounts to mount the platform specific
ABI implementation at the appropriate location
● E.g. Cray MPICH, Open MPI, Intel MPI
Copyright Cray Inc 2018
16
C O M P U T E | S T O R E | A N A L Y Z E
Machine Learning
● Challenges
● How do we take advantage of
both GPUs and CPUs?
● Efficiently scale out onto
distributed systems
Copyright Cray Inc 2018
17
C O M P U T E | S T O R E | A N A L Y Z E
GPUs vs CPUs
● GPUs typically best suited
to training models
● More time and resource
intensive
● CPUs typically best suited
to inference
● i.e. Make predictions using a
trained model
● Need different hardware optimisations for each
● Don't necessarily know where our code will run ahead of time
● Therefore compile separately for each environment and
select desired build via container entry point script
● This requires a container runtime that supports GPUs e.g. Shifter or
NVidia Docker
● NB - We're trading off image size for performance
Copyright Cray Inc 2018
18
C O M P U T E | S T O R E | A N A L Y Z E
Distributed Training
● Framework support for
distributed training is not
well optimized
● Typically TCP/IP based
protocols e.g. gRPC
● Esoteric to configure
● Want to utilize full
capabilities of the network
● Uber's Horovod
● https://github.com/uber/horovod
● Uses MPI to better leverage the
network (Inifiniband/RoCE)
● Minor changes needed to your
ML scripts
● Interleaves computation and
communication
● Uses more efficient MPI
collectives where possible
Copyright Cray Inc 2018
19
C O M P U T E | S T O R E | A N A L Y Z E
Horovod vs gRPC Performance
https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy#slide15
Copyright Cray Inc 2018
20
C O M P U T E | S T O R E | A N A L Y Z E
Conclusions
● Scaling open source analytics has some non-obvious
gotchas
● Often assumes a traditional cluster environment
● Most challenges revolve around IO and Networking
● There's some promising open source efforts to solve these
more thoroughly
● Our Roadmap
● Looking to have stock Docker running on next generation
systems
● Leverage more of Kubernetes features to provide a cloud like
self service HPC model
Copyright Cray Inc 2018
21
C O M P U T E | S T O R E | A N A L Y Z E
Questions?
rvesse@cray.com
https://cray.box.com/v/sw-data-july-2018
C O M P U T E | S T O R E | A N A L Y Z E
References - Containers
Copyright Cray Inc 2018
23
Tool Project Homepage/Repository
NERSC Shifter https://github.com/NERSC/shifter
Docker https://docker.com
NVidia Docker https://github.com/NVIDIA/nvidia-docker
Kubernetes https://kubernetes.io
Flannel https://coreos.com/flannel
Weave https://www.weave.works
Cilium https://cilium.io
Calico https://www.projectcalico.org
Romana https://romana.io
C O M P U T E | S T O R E | A N A L Y Z E
References - Analytics & Data Science
Copyright Cray Inc 2018
24
Tool Project Homepage/Repository
Apache Hadoop https://hadoop.apache.org
Anaconda https://conda.io/docs/
Dask http://dask.pydata.org/en/latest/
NumPy http://www.numpy.org
xarray http://xarray.pydata.org/en/stable/
SciPy https://www.scipy.org
Pandas https://pandas.pydata.org
mpi4py http://mpi4py.scipy.org/docs/
Intel Distribution of Python https://software.intel.com/en-us/distribution-for-
python
C O M P U T E | S T O R E | A N A L Y Z E
References - Machine Learning
Copyright Cray Inc 2018
25
Tool Project Homepage/Repository
TensorFlow https://www.tensorflow.org
gRPC https://grpc.io
Horovod https://github.com/uber/horovod

Mais conteúdo relacionado

Mais procurados

LAS16-TR06: Remoteproc & rpmsg development
LAS16-TR06: Remoteproc & rpmsg developmentLAS16-TR06: Remoteproc & rpmsg development
LAS16-TR06: Remoteproc & rpmsg developmentLinaro
 
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLinaro
 
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...Linaro
 
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGaiPGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGaiEqunix Business Solutions
 
LAS16-109: LAS16-109: The status quo and the future of 96Boards
LAS16-109: LAS16-109: The status quo and the future of 96BoardsLAS16-109: LAS16-109: The status quo and the future of 96Boards
LAS16-109: LAS16-109: The status quo and the future of 96BoardsLinaro
 
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by HisiliconLAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by HisiliconLinaro
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...AMD Developer Central
 
LAS16-108: JerryScript and other scripting languages for IoT
LAS16-108: JerryScript and other scripting languages for IoTLAS16-108: JerryScript and other scripting languages for IoT
LAS16-108: JerryScript and other scripting languages for IoTLinaro
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningAMD Developer Central
 
Programming the Network Data Plane
Programming the Network Data PlaneProgramming the Network Data Plane
Programming the Network Data PlaneC4Media
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
LAS16-500: The Rise and Fall of Assembler and the VGIC from HellLAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
LAS16-500: The Rise and Fall of Assembler and the VGIC from HellLinaro
 
LAS16-106: GNU Toolchain Development Lifecycle
LAS16-106: GNU Toolchain Development LifecycleLAS16-106: GNU Toolchain Development Lifecycle
LAS16-106: GNU Toolchain Development LifecycleLinaro
 
LAS16-201: ART JIT in Android N
LAS16-201: ART JIT in Android NLAS16-201: ART JIT in Android N
LAS16-201: ART JIT in Android NLinaro
 
LAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLinaro
 

Mais procurados (20)

LAS16-TR06: Remoteproc & rpmsg development
LAS16-TR06: Remoteproc & rpmsg developmentLAS16-TR06: Remoteproc & rpmsg development
LAS16-TR06: Remoteproc & rpmsg development
 
Foss Gadgematics
Foss GadgematicsFoss Gadgematics
Foss Gadgematics
 
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project Update
 
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
 
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGaiPGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
PGConf.ASIA 2019 Bali - Keynote Speech 3 - Kohei KaiGai
 
LAS16-109: LAS16-109: The status quo and the future of 96Boards
LAS16-109: LAS16-109: The status quo and the future of 96BoardsLAS16-109: LAS16-109: The status quo and the future of 96Boards
LAS16-109: LAS16-109: The status quo and the future of 96Boards
 
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by HisiliconLAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
LAS16-310: Introducing the first 96Boards TV Platform: Poplar by Hisilicon
 
ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014ODP Presentation LinuxCon NA 2014
ODP Presentation LinuxCon NA 2014
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
LAS16-108: JerryScript and other scripting languages for IoT
LAS16-108: JerryScript and other scripting languages for IoTLAS16-108: JerryScript and other scripting languages for IoT
LAS16-108: JerryScript and other scripting languages for IoT
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 
Programming the Network Data Plane
Programming the Network Data PlaneProgramming the Network Data Plane
Programming the Network Data Plane
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
LAS16-500: The Rise and Fall of Assembler and the VGIC from HellLAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
 
LAS16-106: GNU Toolchain Development Lifecycle
LAS16-106: GNU Toolchain Development LifecycleLAS16-106: GNU Toolchain Development Lifecycle
LAS16-106: GNU Toolchain Development Lifecycle
 
LAS16-201: ART JIT in Android N
LAS16-201: ART JIT in Android NLAS16-201: ART JIT in Android N
LAS16-201: ART JIT in Android N
 
LAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMG
 
DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
 

Semelhante a Leveraging open source for large scale analytics

Advanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsAdvanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsRogue Wave Software
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28Amazon Web Services
 
Containerized MySQL OpenWorld talk
Containerized MySQL OpenWorld talkContainerized MySQL OpenWorld talk
Containerized MySQL OpenWorld talkPatrick Galbraith
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)Ricardo Amaro
 
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...OpenShift Origin
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17Muga Nishizawa
 
Benchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutionsBenchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutionsZhidong Yu
 
Performance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and CassandraPerformance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and CassandraDave Bechberger
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 
Host Data Plane Acceleration: SmartNIC Deployment Models
Host Data Plane Acceleration: SmartNIC Deployment ModelsHost Data Plane Acceleration: SmartNIC Deployment Models
Host Data Plane Acceleration: SmartNIC Deployment ModelsNetronome
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For ArchitectsKevin Brockhoff
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in ContainerizationRyan Hunter
 
Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Rogue Wave Software
 
Open Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesOpen Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesAll Things Open
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?ArangoDB Database
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Linaro
 

Semelhante a Leveraging open source for large scale analytics (20)

Advanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsAdvanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applications
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
 
Containerized MySQL OpenWorld talk
Containerized MySQL OpenWorld talkContainerized MySQL OpenWorld talk
Containerized MySQL OpenWorld talk
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)
 
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
 
Benchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutionsBenchmarking sahara based big data as a service solutions
Benchmarking sahara based big data as a service solutions
 
Performance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and CassandraPerformance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and Cassandra
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Host Data Plane Acceleration: SmartNIC Deployment Models
Host Data Plane Acceleration: SmartNIC Deployment ModelsHost Data Plane Acceleration: SmartNIC Deployment Models
Host Data Plane Acceleration: SmartNIC Deployment Models
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
 
Truemotion Adventures in Containerization
Truemotion Adventures in ContainerizationTruemotion Adventures in Containerization
Truemotion Adventures in Containerization
 
Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)
 
Open Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesOpen Source Applied - Real World Use Cases
Open Source Applied - Real World Use Cases
 
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
gVisor, Kata Containers, Firecracker, Docker: Who is Who in the Container Space?
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
 

Mais de South West Data Meetup

Time Series Analytics for Big Fast Data
Time Series Analytics for Big Fast DataTime Series Analytics for Big Fast Data
Time Series Analytics for Big Fast DataSouth West Data Meetup
 
@Bristol Data Dome Workshop (ISO/Urban Tide)
@Bristol Data Dome Workshop (ISO/Urban Tide)@Bristol Data Dome Workshop (ISO/Urban Tide)
@Bristol Data Dome Workshop (ISO/Urban Tide)South West Data Meetup
 
Assurance Scoring: using machine learning and analytics to reduce risk in the...
Assurance Scoring: using machine learning and analytics to reduce risk in the...Assurance Scoring: using machine learning and analytics to reduce risk in the...
Assurance Scoring: using machine learning and analytics to reduce risk in the...South West Data Meetup
 
Imagine Bristol - interactive workshop day
Imagine Bristol - interactive workshop dayImagine Bristol - interactive workshop day
Imagine Bristol - interactive workshop daySouth West Data Meetup
 
@Bristol Data Dome workshop - NSC Creative
@Bristol Data Dome workshop - NSC Creative@Bristol Data Dome workshop - NSC Creative
@Bristol Data Dome workshop - NSC CreativeSouth West Data Meetup
 
Bristol is Open: Exploring Open Data in the City
Bristol is Open: Exploring Open Data in the CityBristol is Open: Exploring Open Data in the City
Bristol is Open: Exploring Open Data in the CitySouth West Data Meetup
 

Mais de South West Data Meetup (11)

Met Office Informatics Lab
Met Office Informatics LabMet Office Informatics Lab
Met Office Informatics Lab
 
Time Series Analytics for Big Fast Data
Time Series Analytics for Big Fast DataTime Series Analytics for Big Fast Data
Time Series Analytics for Big Fast Data
 
@Bristol Data Dome Workshop (ISO/Urban Tide)
@Bristol Data Dome Workshop (ISO/Urban Tide)@Bristol Data Dome Workshop (ISO/Urban Tide)
@Bristol Data Dome Workshop (ISO/Urban Tide)
 
Assurance Scoring: using machine learning and analytics to reduce risk in the...
Assurance Scoring: using machine learning and analytics to reduce risk in the...Assurance Scoring: using machine learning and analytics to reduce risk in the...
Assurance Scoring: using machine learning and analytics to reduce risk in the...
 
Imagine Bristol - interactive workshop day
Imagine Bristol - interactive workshop dayImagine Bristol - interactive workshop day
Imagine Bristol - interactive workshop day
 
Open Data Institute (ODI) Node
Open Data Institute (ODI) NodeOpen Data Institute (ODI) Node
Open Data Institute (ODI) Node
 
Bristol's Open Data Journey
Bristol's Open Data JourneyBristol's Open Data Journey
Bristol's Open Data Journey
 
@Bristol Data Dome workshop - NSC Creative
@Bristol Data Dome workshop - NSC Creative@Bristol Data Dome workshop - NSC Creative
@Bristol Data Dome workshop - NSC Creative
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
 
Bristol is Open: Exploring Open Data in the City
Bristol is Open: Exploring Open Data in the CityBristol is Open: Exploring Open Data in the City
Bristol is Open: Exploring Open Data in the City
 
Ask bigger questions
Ask bigger questionsAsk bigger questions
Ask bigger questions
 

Último

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

Leveraging open source for large scale analytics

  • 1. Leveraging Open Source for Large Scale Analytics on HPC Systems Rob Vesse, Software Engineer, Cray Inc
  • 2. C O M P U T E | S T O R E | A N A L Y Z E Overview ● Background ● Challenges ● Packaging and Deployment ● Input/Output ● Scaling Analytics ● Python Data Science ● Machine Learning Slides: https://cray.box.com/v/sw-data-july-2018 Copyright Cray Inc 2018 2
  • 3. C O M P U T E | S T O R E | A N A L Y Z E Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, and URIKA. The following are trademarks of Cray Inc.: APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, REVEAL, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright Cray Inc 2018 3
  • 4. C O M P U T E | S T O R E | A N A L Y Z E Background ● About Me ● Software Engineer in the Analytics R&D Group ● Develop hardware and software solutions across Cray's product portfolio ● Primarily focused on integrating open source software into a coherent user friendly product ● Involved in open source for ~15 years, committer at Apache Software Foundation since 2012, and member since 2015 ● Definition - High Performance Computing (HPC) ● Any sufficiently large high performance computer ● Typically $500,000 dollars plus ● As small as 10s of nodes up to 10,000s of nodes ● Creates some interesting scaling and implementation challenges for analytics ● Why analytics on HPC Systems? ● Scale ● Productivity ● Utilization Copyright Cray Inc 2018 4
  • 5. C O M P U T E | S T O R E | A N A L Y Z E Packaging and Deployment ● Challenges ● HPC Systems are highly controlled environments ● Users are granted the minimum permissions possible ● Many open source packages have extensive dependencies or expect users to bring in their own Copyright Cray Inc 2018 5
  • 6. C O M P U T E | S T O R E | A N A L Y Z E Solution - Containers ● An easy solution right? ● HPC Sysadmins are really paranoid ● Docker still considered insecure by many ● NERSC Shifter ● A HPC centric containerizer, used on our top end systems ● Designed to scale out massively ● Forces containerized process to run as the launching users UID ● Can consume Docker images but has own image gateway and format ● Docker ● Currently used for our cluster systems ● Eventually will be used on our next generation supercomputers Copyright Cray Inc 2018 6
  • 7. C O M P U T E | S T O R E | A N A L Y Z E Containers - Shifter vs Docker ● Both are open source so why choose Docker? ● https://github.com/NERSC/shifter ● https://github.com/docker ● Docker has a far more vibrant community ● Many of its shortcomings for HPC have or are being addressed ● E.g. Container access to hardware devices like GPUs ● NVidia Docker - https://github.com/NVIDIA/nvidia-docker ● It's Open Container Initiative (OCI) compliant ● Docker can be used with other key technologies e.g. Kubernetes Copyright Cray Inc 2018 7
  • 8. C O M P U T E | S T O R E | A N A L Y Z E Orchestration ● For distributed applications we need something to tie the containers together ● Also want to support multi-tenant isolation ● Kubernetes ● Fastest growing container orchestrator out there ● Open APIs and highly extensible ● Declaratively specify complex applications and self-service configuration via APIs ● E.g. Deploying Apache Spark on Kubernetes using Bloomberg's Kerberos support mods ● Biggest problem for us is networking! Copyright Cray Inc 2018 8
  • 9. C O M P U T E | S T O R E | A N A L Y Z E Kubernetes Cluster Networking ● Kubernetes has a networking model that supports customizable network providers ● Differing capabilities, bare networking through to network traffic policy management ● E.g. isolated Tenant A from Tenant B ● Different providers use different approaches e.g. ● Flannel and Weave use VXLAN ● Cilium uses eBPF ● Calico and Romana uses static routing ● Our Aries network doesn't support VLANs and our kernel doesn't support eBPF! ● Therefore we chose Romana Copyright Cray Inc 2018 9
  • 10. C O M P U T E | S T O R E | A N A L Y Z E Input/Output Challenges ● Lots of analytics frameworks e.g. Apache Hadoop Map/Reduce, Apache Spark rely on local storage ● E.g. temporary scratch space ● BUT many HPC systems have no local storage Map task thread Block manager Disk Reduce task threadRequest TCP Spark Scheduler Shuffle write Shuffle read Meta data Copyright Cray Inc 2018 10
  • 11. C O M P U T E | S T O R E | A N A L Y Z E Virtual Local Storage ● tmpfs/ramfs ● Standard temporary file system for *nix OSes ● Stored in RAM ● tmpfs is preferred as can be specified with a max size ● BUT competes with your analytics frameworks for memory ● Use the systems parallel file system e.g. Lustre ● Unfortunately these aren't designed for small file IO ● Deadlocks the metadata servers causing significant slowdown for everyone! ● Using Linux loopback mounts to solve this ● Short lived files never leave OS disk cache i.e. still in memory ● OS can flush OS disk cache as needed Copyright Cray Inc 2018 11
  • 12. C O M P U T E | S T O R E | A N A L Y Z E Python Data Science ● Challenges ● Managing dependencies ● Compute nodes typically have no external network connectivity ● Distributed computation ● Maximising hardware utilization for performance Copyright Cray Inc 2018 12
  • 13. C O M P U T E | S T O R E | A N A L Y Z E Dependency Management ● Using Anaconda to solve this ● Have to resolve the environments up front ● Compute nodes can't access external network ● Also need to project environments onto compute nodes as needed ● For containers use volume mounts and environment variable injection into the container ● For standard jobs need to store environments on a file system visible to compute nodes Copyright Cray Inc 2018 13
  • 14. C O M P U T E | S T O R E | A N A L Y Z E Distributed Computation - Dask ● Distributed work scheduling library for Python ● Integrates with common data science libraries ● Numpy, Pandas, SciKit-Learn ● Familiar Pythonic API for scaling out workloads ● Can be installed as part of the Conda environment >>> from dask.distributed import Client >>> client = Client(scheduler_file='/path/to/scheduler.json') >>> def square(x): return x ** 2 >>> def neg(x): return -x >>> A = client.map(square, range(10)) >>> B = client.map(neg, A) >>> total = client.submit(sum, B) >>> total # Function hasn't yet completed <Future: status: waiting, key: sum- 58999c52e0fa35c7d7346c098f5085c7> >>> total.result() -285 >>> client.gather(A) [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] Copyright Cray Inc 2018 14
  • 15. C O M P U T E | S T O R E | A N A L Y Z E Dask - Scheduler & Environment Setup ● Using Dask requires running scheduler and worker processes on our compute resources ● We don't necessarily know the set of physical nodes we will get ahead of time ● Dask provides a scheduler file mechanism for this ● Need to start a scheduler and worker on each physical node ● We use the entry point scripts of our container images to do this ● Also need to integrate with users Conda environment ● MUST activate the volume mounted environments prior to starting Dask Copyright Cray Inc 2018 15
  • 16. C O M P U T E | S T O R E | A N A L Y Z E Maximising Performance ● To fully take advantage of HPC hardware need to use appropriately optimized libraries ● Option 1 - Custom Anaconda Channels ● E.g. Intel Distribution for Python ● Uses Intel AVX and MKL (Math Kernel Library) underneath popular libraries ● Option 2 - ABI Injection ● Where a library uses a defined ABI e.g. mpi4py ensure it is compiled against the generic ABI ● At runtime use volume mounts to mount the platform specific ABI implementation at the appropriate location ● E.g. Cray MPICH, Open MPI, Intel MPI Copyright Cray Inc 2018 16
  • 17. C O M P U T E | S T O R E | A N A L Y Z E Machine Learning ● Challenges ● How do we take advantage of both GPUs and CPUs? ● Efficiently scale out onto distributed systems Copyright Cray Inc 2018 17
  • 18. C O M P U T E | S T O R E | A N A L Y Z E GPUs vs CPUs ● GPUs typically best suited to training models ● More time and resource intensive ● CPUs typically best suited to inference ● i.e. Make predictions using a trained model ● Need different hardware optimisations for each ● Don't necessarily know where our code will run ahead of time ● Therefore compile separately for each environment and select desired build via container entry point script ● This requires a container runtime that supports GPUs e.g. Shifter or NVidia Docker ● NB - We're trading off image size for performance Copyright Cray Inc 2018 18
  • 19. C O M P U T E | S T O R E | A N A L Y Z E Distributed Training ● Framework support for distributed training is not well optimized ● Typically TCP/IP based protocols e.g. gRPC ● Esoteric to configure ● Want to utilize full capabilities of the network ● Uber's Horovod ● https://github.com/uber/horovod ● Uses MPI to better leverage the network (Inifiniband/RoCE) ● Minor changes needed to your ML scripts ● Interleaves computation and communication ● Uses more efficient MPI collectives where possible Copyright Cray Inc 2018 19
  • 20. C O M P U T E | S T O R E | A N A L Y Z E Horovod vs gRPC Performance https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy#slide15 Copyright Cray Inc 2018 20
  • 21. C O M P U T E | S T O R E | A N A L Y Z E Conclusions ● Scaling open source analytics has some non-obvious gotchas ● Often assumes a traditional cluster environment ● Most challenges revolve around IO and Networking ● There's some promising open source efforts to solve these more thoroughly ● Our Roadmap ● Looking to have stock Docker running on next generation systems ● Leverage more of Kubernetes features to provide a cloud like self service HPC model Copyright Cray Inc 2018 21
  • 22. C O M P U T E | S T O R E | A N A L Y Z E Questions? rvesse@cray.com https://cray.box.com/v/sw-data-july-2018
  • 23. C O M P U T E | S T O R E | A N A L Y Z E References - Containers Copyright Cray Inc 2018 23 Tool Project Homepage/Repository NERSC Shifter https://github.com/NERSC/shifter Docker https://docker.com NVidia Docker https://github.com/NVIDIA/nvidia-docker Kubernetes https://kubernetes.io Flannel https://coreos.com/flannel Weave https://www.weave.works Cilium https://cilium.io Calico https://www.projectcalico.org Romana https://romana.io
  • 24. C O M P U T E | S T O R E | A N A L Y Z E References - Analytics & Data Science Copyright Cray Inc 2018 24 Tool Project Homepage/Repository Apache Hadoop https://hadoop.apache.org Anaconda https://conda.io/docs/ Dask http://dask.pydata.org/en/latest/ NumPy http://www.numpy.org xarray http://xarray.pydata.org/en/stable/ SciPy https://www.scipy.org Pandas https://pandas.pydata.org mpi4py http://mpi4py.scipy.org/docs/ Intel Distribution of Python https://software.intel.com/en-us/distribution-for- python
  • 25. C O M P U T E | S T O R E | A N A L Y Z E References - Machine Learning Copyright Cray Inc 2018 25 Tool Project Homepage/Repository TensorFlow https://www.tensorflow.org gRPC https://grpc.io Horovod https://github.com/uber/horovod