SlideShare uma empresa Scribd logo
1 de 62
Baixar para ler offline
1
GOAI: GPU-ACCELERATED
DATA SCIENCE
Joshua Patterson | Director of Applied Solutions Engineering | DataSciCon 2017
@datametrician
2
SPARK ECOSYSTEM
The Glue of Big Data
• Spark has almost become synonymous with Hadoop and Big Data
• It’s the interface/API for big data app to app communication
• The processing layer for big data and leading ML framework
3
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
4
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
Spark In-Memory Processing
5
SPARK ECOSYSTEM
Lacks Full GPU Integration
• 4 Core Parts: SQL, Streaming (Spark functions micro batched), Machine Learning, & Graph
• Spark is currently optimizing its existing code base, adding more usability, not GPU support yet
6
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
7
Pre-GPU DATA FRAME
CURRENT
H2O.ai
Graphistry Anaconda
Gunrock
BlazingDB MapD
CPU
APP A
APP B
Copy & Convert
Copy & Convert Copy & Convert
Copy & Convert Copy & ConvertCopy & Convert Copy & Convert
Too Much Glue Code & Lack Of Standards
• For GPU applications to talk to each other data must be copy
and converted up to three times
• Each company has to build and maintain connectors to copy
and convert
• Some products wanted direct connectors to other
products
• Reduced hops but more for them to maintain and
develop
• A standard was needed
• ISVs always starting from scratch
• Barrier to entry and integration
8
GPU Data Frame
Data Movement Kills Performance
Volume of data
Numberofdatahandoffs
Handoff
Pre-GPU DATA FRAME
9
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
Read DataH2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
10
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
Read DataH2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
11
INTEROPERABILITY IN BIG DATA
Lessons Learned From Apache Arrow & Parquet
• Both Apache Arrow and Apache
Parquet are compressed columnar
storage
• Arrow resides in memory whereas
Parquet resides on disk
• Major push in the big data world to
remove bottlenecks of copy &
converting data between systems
that was a major issue in the GPU
world
12
GPU-ACCELERATED ARCHITECTURE NOW
Single data format and shared access to data on GPU
CPU GPU
GPU
MEM
Read DataH2O.ai
Anaconda Gunrock
Graphistry
BlazingDB MapD Load Data
Apache Arrow
GPU
Data
Frame
Based on:
13
GPU OPEN ANALYTICS INITIATIVE
github.com/gpuopenanalytics
GPU Data Frame (GDF)
Ingest/
Parse
Exploratory
Analysis
Feature
Engineering
ML/DL
Algorithms
Grid Search
Scoring
Model
Export
@gpuoai
Apache Arrow
14
EASY TO USE
@gpuoai
15
EASY TO USE
@gpuoai
16
USE GPUS IN PYTHON
@gpuoai
17
GROWING COMMUNITY SUPPORT
Apache Arrow Apache Parquet
18
GPU ACCELERATION ACROSS THE ECOSYSTEM
19
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
25-100x Improvement
Same code
Language flexible
Primarily on GPU
End to End GPU Processing (GOAI)
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
20
Expand GPU Usage
More Data, Less Hardware
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
2008 2010 2012 2014 2016 2017
Peak Double Precision
NVIDIA GPU x86 CPU
TFLOPS
Scaling up and out with GPU co-
processors
21
ANACONDA
Python ETL for GPU
A Python open-source just-in-time
optimizing compiler that uses LLVM to
produce native machine instructions.
Primary Contributor to PyGDF.
Dask is a flexible parallel computing
library for analytic computing with
dynamic task scheduling and big data
collections.
Primary contributor to Dask_GDF.
Jeremy Howard
Deep learning researcher & educator.
Founder: fast.ai; Faculty: USF &
Singularity University; // Previously - CEO:
Enlitic; President: Kaggle; CEO Fastmail
Rewrote @scikit_learn PolynomialFeatures in
@ContinuumIO Numba. Got a 40x speedup (would
be bigger with more data!) 12 lines of code
22
BLAZINGDB
Scale out Datawarehousing
23
Optimized Networking
GPU Analysis and MLGPU Rendering
GRAPHISTRY
Graph Visualization
Hunting: Daily Anomalies SecOps: Shadow IT UseIR: Killchain Analysis Fraud: Tracking EmbezzlersThreat Intel: Botnet Analysis
24
GPU-accelerated graph analytics library
Multi-GPU optimized algorithms
Reduced cost and increased performance
Performance constantly improving
GUNROCK
25
H2O.AI
H2O4GPU - GPU Machine Learning Library
26
27
87
51
171 with latest solver
28
29
MAPD
MapD Core MapD Immerse
LLVM Backend Rendering Streaming
LLVM creates one custom function that
runs at speeds approaching hand-written
functions. LLVM enables generic
targeting of different architectures + run
simultaneously on CPU/GPU.
Speed eliminates need to pre-index or
aggregate data. Compute resides on
GPUs freeing CPUs to parse + ingest.
Finally, newest data can be combined
with billions of rows of “near historical”
data.
Data goes from compute (CUDA) to
graphics (OpenGL) pipeline without copy
and comes back as compressed PNG
(~100 KB) rather than raw data (> 1GB).
30
MAPD ARCHITECTURE
Visualization Libraries
JavaScript libraries that allow
users to build custom web-
based visualization apps
powered by a MapD Core
database based on DC.js.
LLVM
MapD Core SQL queries are
compiled with a just-in-time
(JIT) LLVM based compiler,
and run as NVIDIA GPU
machine code.
Distributed Scale-out
MapD Core has native
distributed scale-out
capabilities. MapD Core users
can query and visualize larger
datasets with much smaller
cluster sizes than traditional
solutions.
High Availability
MapD Core has high
availability functionality that
provides durability and
redundancy. Ingest and
queries are load balanced
across servers for additional
throughput.
Open Source Commercial
31
CYBER SECURITY
An Ideal Use Case for GPU Acceleration
32
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
3. Visualization will be a key part of daily operations, which will allows analyst to label and
train Deep Learning models faster, and validate machine learning prediciton.
33
RULES & PEOPLE DON’T SCALE
Right now, financial services reports it takes an average of 98 days to detect an Advance Threat but
retailers say it can be about seven months.
Once the security community moves beyond the mantras “encrypt everything” and “secure the
perimeter,” it can begin developing intelligent prioritization and response plans to various kinds
of breaches – with a strong focus on integrity.
The challenge lies in efficiently scaling these technologies for practical deployment, and making
them reliable for large networks. This is where the security community should focus its efforts.
http://www.wired.com/2015/12/the-cia-secret-to-cybersecurity-that-no-one-seems-to-get/
Current methods are too slow
34
ATTACKS ARE MORE SOPHISTICATED
How Hackers Hijacked a Bank’s Entire Online Operation
https://www.wired.com/2017/04/hackers-hijacked-banks-entire-online-operation/
35
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
36
MULTI MODEL APPROACH
No Silver Bullet In Cyber Security
nvGRAPH
https://github.com/h2oai/h2o4gpu
# edges = E * 2^S ~34M
37
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
38
GPU ACCELERATION
Accelerate the Pipeline, Not Just Deep Learning
• GPUs for deep learning = proven
• Where else and how else can we use
GPU acceleration?
• Dashboards
• Accelerating data pipeline
• Stream processing
• Building better models faster
• First: GPU databases
Data Ingestion
Data Processing
Visualization
Model Training
Inferencing
39
MOVING TO BIG DATA IS A START
Spark outperforms traditional SIEM
vs
Big Data Solution
10 node cluster - ~$60k in hardware
Production SIEM of Fortune 500 Enterprise Data
450+ columns
~250 million events per day
SIEM
Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
40
MOVING TO BIG DATA IS A START
Spark outperforms traditional SIEM
Typical Scenario Time Period SIEM Big Data Speed Up
1 Show all network communication from one host
(IP) to multiple hosts (IPs)
1 Day 3h 20m 13s 1m 44s 114 Times Faster
1 Week Not Feasible* 4m 05s
2 Retrieve failed logon attempts in Active
Directory
1 Day 18m 26s 1m 37s 10 Times Faster
1 Week 2h 13m 45s 3m 10s 41 Times Faster
3 Search for Malware (exe) in Symantec logs 1 Day 3h 24m 36s 1m 37s 125 Times Faster
1 Week Not Feasible* 3m 22s
4 View all proxy logs for a for specific domain 1 Day 4h 30m 13s 2m 54s 92 Times Faster
1 Week Not Feasible* 1m 09s**
Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
41
GPU DATABASES ARE EVEN FASTER
1.1 Billion Taxi Ride Benchmarks
21 30
1560
80 99
1250
150
269
2250
372
696
2970
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node
Query 1 Query 2 Query 3 Query 4
TimeinMilliseconds
Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of
Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82
10190 8134 19624 85942
42
FIRST PRINCIPLES OF CYBER SECURITY
Where the industry must go
1. Indication of compromise needs to improve as attacks are becoming more sophisticated,
subtle, and hidden in the massive volume and velocity of data. Combining machine
learning, graph analysis, and applied statistics, and integrating these methods with deep
learning is essential to reduce false positives, detect threats faster, and empower analyst
to be more efficient.
2. Event management is an accelerated analytics problem, the volume and velocity of data
from devices requires a new approach that combines all data sources to allow for more
intelligent/advanced threat hunting and exploration at scale across machine data.
3. Visualization will be a key part of daily operations, which will allows analyst to label and
train Deep Learning models faster, and validate machine learning predictions.
43
44
DATA PLATFORM-AS-A-SERVICE
• Handles 1M events/second
• Auto-scales the cluster
automatically
SCALE
• Offers HA with no data-loss
• Always-on architecture
• Data replication
HIGH AVAILABILITY
• Data platform security has
been implemented with
VPCs in AWS
• Dashboard access using
NVIDIA LDAP
SECURITY
• Log-to-analytics
• Kibana, JDBC access
• Accessing data using BI tools
SELF SERVICE
45
ARCHITECTURE
V1
46
ARCHITECTURE
V2 (with MapD)
47
MAPD VS KIBANA
Dashboards Comparison + Performance Test Method
48
DASHBOARD PERFORMANCE
MapD Immerse vs Elastic Kibana
0
100
200
300
1 6 11 16 21 26 31
MapD Immerse (DGX)
MapD Immerse (P2)
Elastic Kibana
x
< 9s
< 12s
Days of Data
TimetoFullyLoad(seconds)
49
VISUALIZATION WITH GPU
Less hardware, more performance, more scale
50
VISUALIZATION WITH GPU
Less hardware, more performance, more scale
1/10th the hardware
1-2 orders of
magnitude more
performance
51
VISUALIZATION WITH GPU
Less hardware, more performance, more scale
1/10th the hardware
1-2 orders of
magnitude more
performance
Real time visualization of 100K+ nodes 1M+ Edges
50-100x faster clustering than other solutions
52
LISTS DO NOT VISUALLY SCALE
Text search is a great
starting point!
Does not scale
Do not see the 30K+ events
nor the IPs, users, nor how
they relate…
53
BAR CHARTS HIDE
RELATIONSHIPS
Good for summaries!
But not: individual items
But not: behaviors, relationships, patterns,
outliers, …
?
54
GRAPHS:
A KEY MISSING
VIEW
Unified Model
Shows entities, events, and relationships
Multipurpose: connect, see, interact
Visual
Inspect individual items
See behavior, patterns, and outliers
Scale to enterprise workloads
55
DIFFERENT GRAPHS, DIFFERENT QUESTIONS
Uni
Ex: Network mapping
“Is it safe to reboot this?”
ip ip
Hyper
Ex: Incident response
“Did this escalate?”
Multi
Ex: SSH trails
“Is a user crossing zones?”
ip
user
userip
ip
user
event
event
user
ip
56
CURRENT WORK
57
CYBERWORKS
CYBERWORKS SIEM SDK
Goals
• Open Source Ecosystem & Select ISVs
• Integration Points w/ leading security vendors
• FireEye
• Splunk
• Palo Alto Networks
Purpose
A platform to allow analysts to hunt and
analyze data faster at scale than traditional
big data to find unknown and zero day threats.
It will accelerate the threat detection
ecosystem and harden cyber defense utilizing
GPU ISVs and Deep Learning Frameworks.
Purpose Built SDK For SIEM Analytics
58
CYBERWORKS ACTIVITIES
Continuous Improvement
Use GPU accelerated
databases to analyze
data to improve
hunting today, as well
as enrich and label
data for Deep Learning
Connect accelerated
DBs to Splunk for event
management, hunting,
and exploration. Use
Graphistry and MapD to
visualize the data for
anomaly and threat
detection in new ways.
The goal is to GPU
accelerate parts of
Splunk through
partnership and
connect/bolt on
GPUDBs/Graphistry
Use ML and Graph
Analytics for feature
extraction and behavioral
analytics, an ensemble
approach to detection.
Expand Deep Learning
training as more data is
labeled/classified, and
threats are caught faster,
building off DL techniques
used in GFN, other
groups, and external ISV.
Generalize Deep Learning for supervised
and unsupervised anomaly and threat
detection (Insider, APT, DDOS, etc…)
while building our own cyber security deep
learning accelerator. Use best practices
from Driveworks and other accelerators
and SDK as a reference architecture.
Leverage DL from other parts of the firm to
accelerate development as well.
While using Splunk
Cloud to protect
Nvidia, we create a
redundant path of data
to enable R&D.
nvGRAPH
59
CYBERWORKS ARCHITECTURE
SecOps
Data Sources
Ingest
Storage
Stream
Processing
Batch
Processing
Serving
Layer
Notebook
Visualization
Graph Processing
cuSTINGER
Graph
Visualization
Interactivity
QuerySpeed
Gunrock
Deep Learning
Machine Learning
60
CYBERWORKS HARDWARE
Scale out Cluster
DGX Cluster
NAS
SIEM
Notebooks
End
User
3rd Party
Apps
Messaging
Queue
Accelerating your SIEM
61
JOIN THE REVOLUTION
Everyone Can Help!
APACHE ARROW APACHE PARQUET GPU Open Analytics
Initiative
https://arrow.apache.org/
@ApacheArrow
https://parquet.apache.org/
@ApacheParquet
http://gpuopenanalytics.com/
@Gpuoai
Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!
Joshua Patterson @datametrician
QUESTIONS?

Mais conteúdo relacionado

Mais procurados

MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
 
Druid in Spot Instances
Druid in Spot InstancesDruid in Spot Instances
Druid in Spot InstancesImply
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
 
End-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooEnd-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooJason Dai
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected BreweryJason Hubbard
 
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowListening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowDatabricks
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Big Data Spain
 
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...DataWorks Summit
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Jeff Hung
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsKinetica
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTGuido Schmutz
 
Building a future-proof cyber security platform with Apache Metron
Building a future-proof cyber security platform with Apache MetronBuilding a future-proof cyber security platform with Apache Metron
Building a future-proof cyber security platform with Apache MetronDataWorks Summit
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Data Con LA
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Spark Summit
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInOSCON Byrum
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
 

Mais procurados (20)

MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 
Druid in Spot Instances
Druid in Spot InstancesDruid in Spot Instances
Druid in Spot Instances
 
Shaping a Digital Vision
Shaping a Digital VisionShaping a Digital Vision
Shaping a Digital Vision
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
 
End-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooEnd-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics Zoo
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
 
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowListening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
Beyond Kerberos and Ranger - Tips to discover, track and manage risks in hybr...
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
 
Reliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoT
 
Building a future-proof cyber security platform with Apache Metron
Building a future-proof cyber security platform with Apache MetronBuilding a future-proof cyber security platform with Apache Metron
Building a future-proof cyber security platform with Apache Metron
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedIn
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
 
Apache Spot
Apache SpotApache Spot
Apache Spot
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
 

Semelhante a GOAI: GPU-Accelerated Data Science DataSciCon 2017

NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...Manish Harsh
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFKeith Kraus
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next DecadePaula Koziol
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 
GPU 101: The Beast In Data Centers
GPU 101: The Beast In Data CentersGPU 101: The Beast In Data Centers
GPU 101: The Beast In Data CentersRommel Garcia
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabaseKinetica
 
How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...
How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...
How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...Precisely
 
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...Steven Totman
 
Innovation with ai at scale on the edge vt sept 2019 v0
Innovation with ai at scale  on the edge vt sept 2019 v0Innovation with ai at scale  on the edge vt sept 2019 v0
Innovation with ai at scale on the edge vt sept 2019 v0Ganesan Narayanasamy
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & TomorrowAmsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & TomorrowNeo4j
 
The Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfThe Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfNeo4j
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesMapR Technologies
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 

Semelhante a GOAI: GPU-Accelerated Data Science DataSciCon 2017 (20)

Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
GPU 101: The Beast In Data Centers
GPU 101: The Beast In Data CentersGPU 101: The Beast In Data Centers
GPU 101: The Beast In Data Centers
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
 
How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...
How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...
How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...
 
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
 
Innovation with ai at scale on the edge vt sept 2019 v0
Innovation with ai at scale  on the edge vt sept 2019 v0Innovation with ai at scale  on the edge vt sept 2019 v0
Innovation with ai at scale on the edge vt sept 2019 v0
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & TomorrowAmsterdam - The Neo4j Graph Data Platform Today & Tomorrow
Amsterdam - The Neo4j Graph Data Platform Today & Tomorrow
 
The Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdfThe Neo4j Data Platform for Today & Tomorrow.pdf
The Neo4j Data Platform for Today & Tomorrow.pdf
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best Practices
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 

Último

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

GOAI: GPU-Accelerated Data Science DataSciCon 2017

  • 1. 1 GOAI: GPU-ACCELERATED DATA SCIENCE Joshua Patterson | Director of Applied Solutions Engineering | DataSciCon 2017 @datametrician
  • 2. 2 SPARK ECOSYSTEM The Glue of Big Data • Spark has almost become synonymous with Hadoop and Big Data • It’s the interface/API for big data app to app communication • The processing layer for big data and leading ML framework
  • 3. 3 DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk
  • 4. 4 DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk 25-100x Improvement Less code Language flexible Primarily In-Memory Spark In-Memory Processing
  • 5. 5 SPARK ECOSYSTEM Lacks Full GPU Integration • 4 Core Parts: SQL, Streaming (Spark functions micro batched), Machine Learning, & Graph • Spark is currently optimizing its existing code base, adding more usability, not GPU support yet
  • 6. 6 25-100x Improvement Less code Language flexible Primarily In-Memory DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More code Language rigid Substantially on GPU GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  • 7. 7 Pre-GPU DATA FRAME CURRENT H2O.ai Graphistry Anaconda Gunrock BlazingDB MapD CPU APP A APP B Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & ConvertCopy & Convert Copy & Convert Too Much Glue Code & Lack Of Standards • For GPU applications to talk to each other data must be copy and converted up to three times • Each company has to build and maintain connectors to copy and convert • Some products wanted direct connectors to other products • Reduced hops but more for them to maintain and develop • A standard was needed • ISVs always starting from scratch • Barrier to entry and integration
  • 8. 8 GPU Data Frame Data Movement Kills Performance Volume of data Numberofdatahandoffs Handoff Pre-GPU DATA FRAME
  • 9. 9 APP A GPU-ACCELERATED ARCHITECTURE THEN Too much data movement and too many different data formats CPU GPU APP B Read DataH2O.ai Anaconda Gunrock Graphistry BlazingDB MapD Copy & Convert Copy & Convert Copy & Convert Load Data APP A GPU Data APP B GPU Data
  • 10. 10 APP A GPU-ACCELERATED ARCHITECTURE THEN Too much data movement and too many different data formats CPU GPU APP B Read DataH2O.ai Anaconda Gunrock Graphistry BlazingDB MapD Copy & Convert Copy & Convert Copy & Convert Load Data APP A GPU Data APP B GPU Data
  • 11. 11 INTEROPERABILITY IN BIG DATA Lessons Learned From Apache Arrow & Parquet • Both Apache Arrow and Apache Parquet are compressed columnar storage • Arrow resides in memory whereas Parquet resides on disk • Major push in the big data world to remove bottlenecks of copy & converting data between systems that was a major issue in the GPU world
  • 12. 12 GPU-ACCELERATED ARCHITECTURE NOW Single data format and shared access to data on GPU CPU GPU GPU MEM Read DataH2O.ai Anaconda Gunrock Graphistry BlazingDB MapD Load Data Apache Arrow GPU Data Frame Based on:
  • 13. 13 GPU OPEN ANALYTICS INITIATIVE github.com/gpuopenanalytics GPU Data Frame (GDF) Ingest/ Parse Exploratory Analysis Feature Engineering ML/DL Algorithms Grid Search Scoring Model Export @gpuoai Apache Arrow
  • 16. 16 USE GPUS IN PYTHON @gpuoai
  • 17. 17 GROWING COMMUNITY SUPPORT Apache Arrow Apache Parquet
  • 18. 18 GPU ACCELERATION ACROSS THE ECOSYSTEM
  • 19. 19 25-100x Improvement Less code Language flexible Primarily In-Memory DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read Query ETL ML Train 5-10x Improvement More code Language rigid Substantially on GPU 25-100x Improvement Same code Language flexible Primarily on GPU End to End GPU Processing (GOAI) GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  • 20. 20 Expand GPU Usage More Data, Less Hardware 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 2008 2010 2012 2014 2016 2017 Peak Double Precision NVIDIA GPU x86 CPU TFLOPS Scaling up and out with GPU co- processors
  • 21. 21 ANACONDA Python ETL for GPU A Python open-source just-in-time optimizing compiler that uses LLVM to produce native machine instructions. Primary Contributor to PyGDF. Dask is a flexible parallel computing library for analytic computing with dynamic task scheduling and big data collections. Primary contributor to Dask_GDF. Jeremy Howard Deep learning researcher & educator. Founder: fast.ai; Faculty: USF & Singularity University; // Previously - CEO: Enlitic; President: Kaggle; CEO Fastmail Rewrote @scikit_learn PolynomialFeatures in @ContinuumIO Numba. Got a 40x speedup (would be bigger with more data!) 12 lines of code
  • 23. 23 Optimized Networking GPU Analysis and MLGPU Rendering GRAPHISTRY Graph Visualization Hunting: Daily Anomalies SecOps: Shadow IT UseIR: Killchain Analysis Fraud: Tracking EmbezzlersThreat Intel: Botnet Analysis
  • 24. 24 GPU-accelerated graph analytics library Multi-GPU optimized algorithms Reduced cost and increased performance Performance constantly improving GUNROCK
  • 25. 25 H2O.AI H2O4GPU - GPU Machine Learning Library
  • 26. 26
  • 28. 28
  • 29. 29 MAPD MapD Core MapD Immerse LLVM Backend Rendering Streaming LLVM creates one custom function that runs at speeds approaching hand-written functions. LLVM enables generic targeting of different architectures + run simultaneously on CPU/GPU. Speed eliminates need to pre-index or aggregate data. Compute resides on GPUs freeing CPUs to parse + ingest. Finally, newest data can be combined with billions of rows of “near historical” data. Data goes from compute (CUDA) to graphics (OpenGL) pipeline without copy and comes back as compressed PNG (~100 KB) rather than raw data (> 1GB).
  • 30. 30 MAPD ARCHITECTURE Visualization Libraries JavaScript libraries that allow users to build custom web- based visualization apps powered by a MapD Core database based on DC.js. LLVM MapD Core SQL queries are compiled with a just-in-time (JIT) LLVM based compiler, and run as NVIDIA GPU machine code. Distributed Scale-out MapD Core has native distributed scale-out capabilities. MapD Core users can query and visualize larger datasets with much smaller cluster sizes than traditional solutions. High Availability MapD Core has high availability functionality that provides durability and redundancy. Ingest and queries are load balanced across servers for additional throughput. Open Source Commercial
  • 31. 31 CYBER SECURITY An Ideal Use Case for GPU Acceleration
  • 32. 32 FIRST PRINCIPLES OF CYBER SECURITY Where the industry must go 1. Indication of compromise needs to improve as attacks are becoming more sophisticated, subtle, and hidden in the massive volume and velocity of data. Combining machine learning, graph analysis, and applied statistics, and integrating these methods with deep learning is essential to reduce false positives, detect threats faster, and empower analyst to be more efficient. 2. Event management is an accelerated analytics problem, the volume and velocity of data from devices requires a new approach that combines all data sources to allow for more intelligent/advanced threat hunting and exploration at scale across machine data. 3. Visualization will be a key part of daily operations, which will allows analyst to label and train Deep Learning models faster, and validate machine learning prediciton.
  • 33. 33 RULES & PEOPLE DON’T SCALE Right now, financial services reports it takes an average of 98 days to detect an Advance Threat but retailers say it can be about seven months. Once the security community moves beyond the mantras “encrypt everything” and “secure the perimeter,” it can begin developing intelligent prioritization and response plans to various kinds of breaches – with a strong focus on integrity. The challenge lies in efficiently scaling these technologies for practical deployment, and making them reliable for large networks. This is where the security community should focus its efforts. http://www.wired.com/2015/12/the-cia-secret-to-cybersecurity-that-no-one-seems-to-get/ Current methods are too slow
  • 34. 34 ATTACKS ARE MORE SOPHISTICATED How Hackers Hijacked a Bank’s Entire Online Operation https://www.wired.com/2017/04/hackers-hijacked-banks-entire-online-operation/
  • 35. 35 FIRST PRINCIPLES OF CYBER SECURITY Where the industry must go 1. Indication of compromise needs to improve as attacks are becoming more sophisticated, subtle, and hidden in the massive volume and velocity of data. Combining machine learning, graph analysis, and applied statistics, and integrating these methods with deep learning is essential to reduce false positives, detect threats faster, and empower analyst to be more efficient.
  • 36. 36 MULTI MODEL APPROACH No Silver Bullet In Cyber Security nvGRAPH https://github.com/h2oai/h2o4gpu # edges = E * 2^S ~34M
  • 37. 37 FIRST PRINCIPLES OF CYBER SECURITY Where the industry must go 1. Indication of compromise needs to improve as attacks are becoming more sophisticated, subtle, and hidden in the massive volume and velocity of data. Combining machine learning, graph analysis, and applied statistics, and integrating these methods with deep learning is essential to reduce false positives, detect threats faster, and empower analyst to be more efficient. 2. Event management is an accelerated analytics problem, the volume and velocity of data from devices requires a new approach that combines all data sources to allow for more intelligent/advanced threat hunting and exploration at scale across machine data.
  • 38. 38 GPU ACCELERATION Accelerate the Pipeline, Not Just Deep Learning • GPUs for deep learning = proven • Where else and how else can we use GPU acceleration? • Dashboards • Accelerating data pipeline • Stream processing • Building better models faster • First: GPU databases Data Ingestion Data Processing Visualization Model Training Inferencing
  • 39. 39 MOVING TO BIG DATA IS A START Spark outperforms traditional SIEM vs Big Data Solution 10 node cluster - ~$60k in hardware Production SIEM of Fortune 500 Enterprise Data 450+ columns ~250 million events per day SIEM Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
  • 40. 40 MOVING TO BIG DATA IS A START Spark outperforms traditional SIEM Typical Scenario Time Period SIEM Big Data Speed Up 1 Show all network communication from one host (IP) to multiple hosts (IPs) 1 Day 3h 20m 13s 1m 44s 114 Times Faster 1 Week Not Feasible* 4m 05s 2 Retrieve failed logon attempts in Active Directory 1 Day 18m 26s 1m 37s 10 Times Faster 1 Week 2h 13m 45s 3m 10s 41 Times Faster 3 Search for Malware (exe) in Symantec logs 1 Day 3h 24m 36s 1m 37s 125 Times Faster 1 Week Not Feasible* 3m 22s 4 View all proxy logs for a for specific domain 1 Day 4h 30m 13s 2m 54s 92 Times Faster 1 Week Not Feasible* 1m 09s** Spark vs SIEM Benchmarks from Accenture Labs - Strata NY, Bsides LV
  • 41. 41 GPU DATABASES ARE EVEN FASTER 1.1 Billion Taxi Ride Benchmarks 21 30 1560 80 99 1250 150 269 2250 372 696 2970 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node Query 1 Query 2 Query 3 Query 4 TimeinMilliseconds Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82 10190 8134 19624 85942
  • 42. 42 FIRST PRINCIPLES OF CYBER SECURITY Where the industry must go 1. Indication of compromise needs to improve as attacks are becoming more sophisticated, subtle, and hidden in the massive volume and velocity of data. Combining machine learning, graph analysis, and applied statistics, and integrating these methods with deep learning is essential to reduce false positives, detect threats faster, and empower analyst to be more efficient. 2. Event management is an accelerated analytics problem, the volume and velocity of data from devices requires a new approach that combines all data sources to allow for more intelligent/advanced threat hunting and exploration at scale across machine data. 3. Visualization will be a key part of daily operations, which will allows analyst to label and train Deep Learning models faster, and validate machine learning predictions.
  • 43. 43
  • 44. 44 DATA PLATFORM-AS-A-SERVICE • Handles 1M events/second • Auto-scales the cluster automatically SCALE • Offers HA with no data-loss • Always-on architecture • Data replication HIGH AVAILABILITY • Data platform security has been implemented with VPCs in AWS • Dashboard access using NVIDIA LDAP SECURITY • Log-to-analytics • Kibana, JDBC access • Accessing data using BI tools SELF SERVICE
  • 47. 47 MAPD VS KIBANA Dashboards Comparison + Performance Test Method
  • 48. 48 DASHBOARD PERFORMANCE MapD Immerse vs Elastic Kibana 0 100 200 300 1 6 11 16 21 26 31 MapD Immerse (DGX) MapD Immerse (P2) Elastic Kibana x < 9s < 12s Days of Data TimetoFullyLoad(seconds)
  • 49. 49 VISUALIZATION WITH GPU Less hardware, more performance, more scale
  • 50. 50 VISUALIZATION WITH GPU Less hardware, more performance, more scale 1/10th the hardware 1-2 orders of magnitude more performance
  • 51. 51 VISUALIZATION WITH GPU Less hardware, more performance, more scale 1/10th the hardware 1-2 orders of magnitude more performance Real time visualization of 100K+ nodes 1M+ Edges 50-100x faster clustering than other solutions
  • 52. 52 LISTS DO NOT VISUALLY SCALE Text search is a great starting point! Does not scale Do not see the 30K+ events nor the IPs, users, nor how they relate…
  • 53. 53 BAR CHARTS HIDE RELATIONSHIPS Good for summaries! But not: individual items But not: behaviors, relationships, patterns, outliers, … ?
  • 54. 54 GRAPHS: A KEY MISSING VIEW Unified Model Shows entities, events, and relationships Multipurpose: connect, see, interact Visual Inspect individual items See behavior, patterns, and outliers Scale to enterprise workloads
  • 55. 55 DIFFERENT GRAPHS, DIFFERENT QUESTIONS Uni Ex: Network mapping “Is it safe to reboot this?” ip ip Hyper Ex: Incident response “Did this escalate?” Multi Ex: SSH trails “Is a user crossing zones?” ip user userip ip user event event user ip
  • 57. 57 CYBERWORKS CYBERWORKS SIEM SDK Goals • Open Source Ecosystem & Select ISVs • Integration Points w/ leading security vendors • FireEye • Splunk • Palo Alto Networks Purpose A platform to allow analysts to hunt and analyze data faster at scale than traditional big data to find unknown and zero day threats. It will accelerate the threat detection ecosystem and harden cyber defense utilizing GPU ISVs and Deep Learning Frameworks. Purpose Built SDK For SIEM Analytics
  • 58. 58 CYBERWORKS ACTIVITIES Continuous Improvement Use GPU accelerated databases to analyze data to improve hunting today, as well as enrich and label data for Deep Learning Connect accelerated DBs to Splunk for event management, hunting, and exploration. Use Graphistry and MapD to visualize the data for anomaly and threat detection in new ways. The goal is to GPU accelerate parts of Splunk through partnership and connect/bolt on GPUDBs/Graphistry Use ML and Graph Analytics for feature extraction and behavioral analytics, an ensemble approach to detection. Expand Deep Learning training as more data is labeled/classified, and threats are caught faster, building off DL techniques used in GFN, other groups, and external ISV. Generalize Deep Learning for supervised and unsupervised anomaly and threat detection (Insider, APT, DDOS, etc…) while building our own cyber security deep learning accelerator. Use best practices from Driveworks and other accelerators and SDK as a reference architecture. Leverage DL from other parts of the firm to accelerate development as well. While using Splunk Cloud to protect Nvidia, we create a redundant path of data to enable R&D. nvGRAPH
  • 59. 59 CYBERWORKS ARCHITECTURE SecOps Data Sources Ingest Storage Stream Processing Batch Processing Serving Layer Notebook Visualization Graph Processing cuSTINGER Graph Visualization Interactivity QuerySpeed Gunrock Deep Learning Machine Learning
  • 60. 60 CYBERWORKS HARDWARE Scale out Cluster DGX Cluster NAS SIEM Notebooks End User 3rd Party Apps Messaging Queue Accelerating your SIEM
  • 61. 61 JOIN THE REVOLUTION Everyone Can Help! APACHE ARROW APACHE PARQUET GPU Open Analytics Initiative https://arrow.apache.org/ @ApacheArrow https://parquet.apache.org/ @ApacheParquet http://gpuopenanalytics.com/ @Gpuoai Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!