The GPU Open Analytics Initiative, GOAI, is accelerating data science like never before. CPUs are not improving at the same rate as networking and storage, and leveraging GPUs data scientist can analyze more data than ever with less hardware. Learn more about how GPU are accelerating data science (not just Deep Learning), and how to get started.
2. 2
SPARK ECOSYSTEM
The Glue of Big Data
• Spark has almost become synonymous with Hadoop and Big Data
• It’s the interface/API for big data app to app communication
• The processing layer for big data and leading ML framework
3. 3
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
4. 4
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
Spark In-Memory Processing
5. 5
SPARK ECOSYSTEM
Lacks Full GPU Integration
• 4 Core Parts: SQL, Streaming (Spark functions micro batched), Machine Learning, & Graph
• Spark is currently optimizing its existing code base, adding more usability, not GPU support yet
6. 6
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
7. 7
Pre-GPU DATA FRAME
CURRENT
H2O.ai
Graphistry Anaconda
Gunrock
BlazingDB MapD
CPU
APP A
APP B
Copy & Convert
Copy & Convert Copy & Convert
Copy & Convert Copy & ConvertCopy & Convert Copy & Convert
Too Much Glue Code & Lack Of Standards
• For GPU applications to talk to each other data must be copy
and converted up to three times
• Each company has to build and maintain connectors to copy
and convert
• Some products wanted direct connectors to other
products
• Reduced hops but more for them to maintain and
develop
• A standard was needed
• ISVs always starting from scratch
• Barrier to entry and integration
8. 8
GPU Data Frame
Data Movement Kills Performance
Volume of data
Numberofdatahandoffs
Handoff
Pre-GPU DATA FRAME
9. 9
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
Read DataH2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
10. 10
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
Read DataH2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
11. 11
INTEROPERABILITY IN BIG DATA
Lessons Learned From Apache Arrow & Parquet
• Both Apache Arrow and Apache
Parquet are compressed columnar
storage
• Arrow resides in memory whereas
Parquet resides on disk
• Major push in the big data world to
remove bottlenecks of copy &
converting data between systems
that was a major issue in the GPU
world
12. 12
GPU-ACCELERATED ARCHITECTURE NOW
Single data format and shared access to data on GPU
CPU GPU
GPU
MEM
Read DataH2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD Load Data
Apache Arrow
GPU
Data
Frame
Based on:
13. 13
GPU OPEN ANALYTICS INITIATIVE
github.com/gpuopenanalytics
GPU Data Frame (GDF)
Ingest/
Parse
Exploratory
Analysis
Feature
Engineering
ML/DL
Algorithms
Grid Search
Scoring
Model
Export
@gpuoai
Apache Arrow
19. 19
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
25-100x Improvement
Same code
Language flexible
Primarily on GPU
End to End GPU Processing (GOAI)
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
20. 20
Expand GPU Usage
More Data, Less Hardware
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
2008 2010 2012 2014 2016 2017
Peak Double Precision
NVIDIA GPU x86 CPU
TFLOPS
Scaling up and out with GPU co-
processors
21. 21
ANACONDA
Python ETL for GPU
A Python open-source just-in-time
optimizing compiler that uses LLVM to
produce native machine instructions.
Primary Contributor to PyGDF.
Dask is a flexible parallel computing
library for analytic computing with
dynamic task scheduling and big data
collections.
Primary contributor to Dask_GDF.
Jeremy Howard
Deep learning researcher & educator.
Founder: fast.ai; Faculty: USF &
Singularity University; // Previously - CEO:
Enlitic; President: Kaggle; CEO Fastmail
Rewrote @scikit_learn PolynomialFeatures in
@ContinuumIO Numba. Got a 40x speedup (would
be bigger with more data!) 12 lines of code
29. 29
MAPD
MapD Core MapD Immerse
LLVM Backend Rendering Streaming
LLVM creates one custom function that
runs at speeds approaching hand-written
functions. LLVM enables generic
targeting of different architectures + run
simultaneously on CPU/GPU.
Speed eliminates need to pre-index or
aggregate data. Compute resides on
GPUs freeing CPUs to parse + ingest.
Finally, newest data can be combined
with billions of rows of “near historical”
data.
Data goes from compute (CUDA) to
graphics (OpenGL) pipeline without copy
and comes back as compressed PNG
(~100 KB) rather than raw data (> 1GB).
30. 30
MAPD ARCHITECTURE
Visualization Libraries
JavaScript libraries that allow
users to build custom web-
based visualization apps
powered by a MapD Core
database based on DC.js.
LLVM
MapD Core SQL queries are
compiled with a just-in-time
(JIT) LLVM based compiler,
and run as NVIDIA GPU
machine code.
Distributed Scale-out
MapD Core has native
distributed scale-out
capabilities. MapD Core users
can query and visualize larger
datasets with much smaller
cluster sizes than traditional
solutions.
High Availability
MapD Core has high
availability functionality that
provides durability and
redundancy. Ingest and
queries are load balanced
across servers for additional
throughput.
Open Source Commercial
32. 32
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
3. Visualization will be a key part of daily operations, which will allows analyst to label and
train Deep Learning models faster, and validate machine learning prediciton.
33. 33
RULES & PEOPLE DON’T SCALE
Right now, financial services reports it takes an average of 98 days to detect an Advance Threat but
retailers say it can be about seven months.
Once the security community moves beyond the mantras “encrypt everything” and “secure the
perimeter,” it can begin developing intelligent prioritization and response plans to various kinds
of breaches – with a strong focus on integrity.
The challenge lies in efficiently scaling these technologies for practical deployment, and making
them reliable for large networks. This is where the security community should focus its efforts.
http://www.wired.com/2015/12/the-cia-secret-to-cybersecurity-that-no-one-seems-to-get/
Current methods are too slow
34. 34
ATTACKS ARE MORE SOPHISTICATED
How Hackers Hijacked a Bank’s Entire Online Operation
https://www.wired.com/2017/04/hackers-hijacked-banks-entire-online-operation/
35. 35
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
36. 36
MULTI MODEL APPROACH
No Silver Bullet In Cyber Security
nvGRAPH
https://github.com/h2oai/h2o4gpu
# edges = E * 2^S ~34M
37. 37
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
38. 38
GPU ACCELERATION
Accelerate the Pipeline, Not Just Deep Learning
• GPUs for deep learning = proven
• Where else and how else can we use
GPU acceleration?
• Dashboards
• Accelerating data pipeline
• Stream processing
• Building better models faster
• First: GPU databases
Data Ingestion
Data Processing
Visualization
Model Training
Inferencing
39. 39
MOVING TO BIG DATA IS A START
Spark outperforms traditional SIEM
vs
Big Data Solution
10 node cluster - ~$60k in hardware
Production SIEM of Fortune 500 Enterprise Data
450+ columns
~250 million events per day
SIEM
Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
40. 40
MOVING TO BIG DATA IS A START
Spark outperforms traditional SIEM
Typical Scenario Time Period SIEM Big Data Speed Up
1 Show all network communication from one host
(IP) to multiple hosts (IPs)
1 Day 3h 20m 13s 1m 44s 114 Times Faster
1 Week Not Feasible* 4m 05s
2 Retrieve failed logon attempts in Active
Directory
1 Day 18m 26s 1m 37s 10 Times Faster
1 Week 2h 13m 45s 3m 10s 41 Times Faster
3 Search for Malware (exe) in Symantec logs 1 Day 3h 24m 36s 1m 37s 125 Times Faster
1 Week Not Feasible* 3m 22s
4 View all proxy logs for a for specific domain 1 Day 4h 30m 13s 2m 54s 92 Times Faster
1 Week Not Feasible* 1m 09s**
Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
41. 41
GPU DATABASES ARE EVEN FASTER
1.1 Billion Taxi Ride Benchmarks
21 30
1560
80 99
1250
150
269
2250
372
696
2970
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node
Query 1 Query 2 Query 3 Query 4
TimeinMilliseconds
Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of
Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82
10190 8134 19624 85942
42. 42
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
3. Visualization will be a key part of daily operations, which will allows analyst to label and
train Deep Learning models faster, and validate machine learning predictions.
44. 44
DATA PLATFORM-AS-A-SERVICE
• Handles 1M events/second
• Auto-scales the cluster
automatically
SCALE
• Offers HA with no data-loss
• Always-on architecture
• Data replication
HIGH AVAILABILITY
• Data platform security has
been implemented with
VPCs in AWS
• Dashboard access using
NVIDIA LDAP
SECURITY
• Log-to-analytics
• Kibana, JDBC access
• Accessing data using BI tools
SELF SERVICE
50. 50
VISUALIZATION WITH GPU
Less hardware, more performance, more scale
1/10th the hardware
1-2 orders of
magnitude more
performance
51. 51
VISUALIZATION WITH GPU
Less hardware, more performance, more scale
1/10th the hardware
1-2 orders of
magnitude more
performance
Real time visualization of 100K+ nodes 1M+ Edges
50-100x faster clustering than other solutions
52. 52
LISTS DO NOT VISUALLY SCALE
Text search is a great
starting point!
Does not scale
Do not see the 30K+ events
nor the IPs, users, nor how
they relate…
54. 54
GRAPHS:
A KEY MISSING
VIEW
Unified Model
Shows entities, events, and relationships
Multipurpose: connect, see, interact
Visual
Inspect individual items
See behavior, patterns, and outliers
Scale to enterprise workloads
55. 55
DIFFERENT GRAPHS, DIFFERENT QUESTIONS
Uni
Ex: Network mapping
“Is it safe to reboot this?”
ip ip
Hyper
Ex: Incident response
“Did this escalate?”
Multi
Ex: SSH trails
“Is a user crossing zones?”
ip
user
userip
ip
user
event
event
user
ip
57. 57
CYBERWORKS
CYBERWORKS SIEM SDK
Goals
• Open Source Ecosystem & Select ISVs
• Integration Points w/ leading security vendors
• FireEye
• Splunk
• Palo Alto Networks
Purpose
A platform to allow analysts to hunt and
analyze data faster at scale than traditional
big data to find unknown and zero day threats.
It will accelerate the threat detection
ecosystem and harden cyber defense utilizing
GPU ISVs and Deep Learning Frameworks.
Purpose Built SDK For SIEM Analytics
58. 58
CYBERWORKS ACTIVITIES
Continuous Improvement
Use GPU accelerated
databases to analyze
data to improve
hunting today, as well
as enrich and label
data for Deep Learning
Connect accelerated
DBs to Splunk for event
management, hunting,
and exploration. Use
Graphistry and MapD to
visualize the data for
anomaly and threat
detection in new ways.
The goal is to GPU
accelerate parts of
Splunk through
partnership and
connect/bolt on
GPUDBs/Graphistry
Use ML and Graph
Analytics for feature
extraction and behavioral
analytics, an ensemble
approach to detection.
Expand Deep Learning
training as more data is
labeled/classified, and
threats are caught faster,
building off DL techniques
used in GFN, other
groups, and external ISV.
Generalize Deep Learning for supervised
and unsupervised anomaly and threat
detection (Insider, APT, DDOS, etc…)
while building our own cyber security deep
learning accelerator. Use best practices
from Driveworks and other accelerators
and SDK as a reference architecture.
Leverage DL from other parts of the firm to
accelerate development as well.
While using Splunk
Cloud to protect
Nvidia, we create a
redundant path of data
to enable R&D.
nvGRAPH