Pulsar Virtual Summit North
America 2021
Kiran Matty
Director of Product Management
Aerospike
2 Pulsar Virtual Summit North
America 2021
▪ Director of Product for Ecosystem @ Aerospike
▪ Domain experience spans Big Data Infrastructure
and Data Security @ Visa, Hortonworks, and Cisco
▪ Interests include large scale distributed systems
and AI/ML
▪ Lego builder in spare time
whoami
3 Pulsar Virtual Summit North
America 2021
Source: Google I/O 2018
Training can take Forever…
TRAINING TIME
Minutes – hours
1 - 4 Days
1 - 4 Weeks
> 1 month
4 Pulsar Virtual Summit North
America 2021
Source: Micron
▪ Traditional HDD
based systems are
not suitable for
Training
▪ Model need to be
retrained to
address data
/Model drift
AI/ML needs Hybrid Storage
5 Pulsar Virtual Summit North
America 2021
AI/ML needs memory-like access at Petabyte scale with lower
TCO
6 Pulsar Virtual Summit North
America 2021
Why do other databases fall short?
Pulsar Virtual Summit North
America 2021
High Frequency Trading IIoT / Predictive Maintenance
Aerospike Drives data-driven decisioning use cases
is it fresh
Fraud Detection Personalization/Customer 360o AdTech Real Time Bidding
8 Pulsar Virtual Summit North
America 2021
CLOUD /
ON-PREM
8
CONNECT
for Spark
Python Client
COMPUTE
STORAGE
NOTEBOOK &
ML PACKAGES
CONTAINER
PLATFORM
A Blueprint for AI/ML
CONNECT
for Pulsar
9 Pulsar Virtual Summit North
America 2021
Why Pulsar?
Durability
Scalability Geo-Replication
Multi-Tenancy
Unified Messaging
Model
10 Pulsar Virtual Summit North
America 2021
Mapping Aerospike <> Pulsar Data models
Aerospike RDBMS Pulsar
Namespace Database Topic
Set(optional) Table Topic
Record Row Record
Bin Column Fields (based
on schema)
Key Key Key
Mapping is via YAML files.
11 Pulsar Virtual Summit North
America 2021
Pub/Sub
API
Pub/Sub
API
Reader and
Batch API
Pulsar
IO/Connectors
Stream Processor
Applications
Prebuilt Connectors Custom Connectors
Aerospike Sink Connector*
Microservices or
Event-Driven Architecture
Publisher
Aerospike Source Connector
Subscriber
Aerospike Connect for Pulsar
IOT/edge devices
Change Notification:
{"metadata":{"namespace":"device","set":"streaming_write_set"
,"digest":"SH0QwiJxdW5Wkf/hAVJGn7Sw37U=","msg":"write","ge
n":38,"lut":0,"exp":0},"three":37089,"two":"two_89","one":37089}
Change
Notification
s
*Not GA’d
Schema Registry
12 Pulsar Virtual Summit North
America 2021
Data
Preparation
Model
Training
Third Party Data
Exploratory
Data
Analysis
Parameter
Tuning
Data Scientist
Model
Validation
MODEL
SERVING
Speeding up Training Pipeline (Conceptual View)
CONNECT
for Spark
Aerospike
Database
System of
Record
AI/ML Platform
ML
Application
HTTP
1
2
4
3
13 Pulsar Virtual Summit North
America 2021
Real-time Inference (Conceptual View)
Edge Systems
across Datacenters
Data
Preparation
HTTP
Model
Serving
Predictions
ML
Application Predictions
Aerospike
Database
Core System
Streaming
Source
CONNECT
for Pulsar
CONNECT
for Pulsar
Application
Specialist
Aerospike
Database
Edge Location 1
Aerospike
Database
Edge Location n
XDR
CONNECT
for Spark
HTTP
API
API
API
Pulsar Spark
Connector
14 Pulsar Virtual Summit North
America 2021
Massive Parallelization
✔80% reduction in Spark Job Execution time
✔Reduced training time
✔Increase frequency of retraining
Operational reliability at extreme scale
✔13B Objects
✔150 TB unique data – multiple times a day
Increased ROI
✔Only 33 Aerospike servers
✔Increased utilization of Spark Cluster (300
nodes and 7,500 cores)
Massive Parallelism w/ Aerospike and Spark
CASE STUDY:
“We were using custom code before which led to data
quality issues and a complex data infrastructure. With
Aerospike, we are processing Spark jobs that used to take
12 hours now in just 2.4.
Senior Director, Data Science and Engineering
Top Global Ad Tech company
GLOBAL AD TECH
COMPANY
15 Pulsar Virtual Summit North
America 2021
Execute Spark jobs faster with massive
parallelism
1. Reduce Training Time
3. Increase Frequency of Re-Training
Conduct in-place data exploration
Create low latency and high throughput
streaming pipeline
1
2
3
The Aerospike
Difference for
AI/ML
Eliminate compliance headaches by removing the need to
copy data into multiple systems
“Aerospike is second to none for
ingesting and persisting millions of
events per second… (Aerospike)
allows me to do near-instantaneous
machine learning on the data as it
lands.”
Theresa Melvin
Chief Architect of AI-Driven Big Data Solutions, HPE
2. Maximize ROI
Aerospike data platform connects readily to Spark
and Pulsar
16 Pulsar Virtual Summit North
America 2021
Thank you
We are hiring for our India and the US offices.
https://aerospike.com/solutions/use-cases/ai-ml/

Add Horsepower to AI/ML streaming Pipeline - Pulsar Summit NA 2021

  • 1.
    Pulsar Virtual SummitNorth America 2021 Kiran Matty Director of Product Management Aerospike
  • 2.
    2 Pulsar VirtualSummit North America 2021 ▪ Director of Product for Ecosystem @ Aerospike ▪ Domain experience spans Big Data Infrastructure and Data Security @ Visa, Hortonworks, and Cisco ▪ Interests include large scale distributed systems and AI/ML ▪ Lego builder in spare time whoami
  • 3.
    3 Pulsar VirtualSummit North America 2021 Source: Google I/O 2018 Training can take Forever… TRAINING TIME Minutes – hours 1 - 4 Days 1 - 4 Weeks > 1 month
  • 4.
    4 Pulsar VirtualSummit North America 2021 Source: Micron ▪ Traditional HDD based systems are not suitable for Training ▪ Model need to be retrained to address data /Model drift AI/ML needs Hybrid Storage
  • 5.
    5 Pulsar VirtualSummit North America 2021 AI/ML needs memory-like access at Petabyte scale with lower TCO
  • 6.
    6 Pulsar VirtualSummit North America 2021 Why do other databases fall short?
  • 7.
    Pulsar Virtual SummitNorth America 2021 High Frequency Trading IIoT / Predictive Maintenance Aerospike Drives data-driven decisioning use cases is it fresh Fraud Detection Personalization/Customer 360o AdTech Real Time Bidding
  • 8.
    8 Pulsar VirtualSummit North America 2021 CLOUD / ON-PREM 8 CONNECT for Spark Python Client COMPUTE STORAGE NOTEBOOK & ML PACKAGES CONTAINER PLATFORM A Blueprint for AI/ML CONNECT for Pulsar
  • 9.
    9 Pulsar VirtualSummit North America 2021 Why Pulsar? Durability Scalability Geo-Replication Multi-Tenancy Unified Messaging Model
  • 10.
    10 Pulsar VirtualSummit North America 2021 Mapping Aerospike <> Pulsar Data models Aerospike RDBMS Pulsar Namespace Database Topic Set(optional) Table Topic Record Row Record Bin Column Fields (based on schema) Key Key Key Mapping is via YAML files.
  • 11.
    11 Pulsar VirtualSummit North America 2021 Pub/Sub API Pub/Sub API Reader and Batch API Pulsar IO/Connectors Stream Processor Applications Prebuilt Connectors Custom Connectors Aerospike Sink Connector* Microservices or Event-Driven Architecture Publisher Aerospike Source Connector Subscriber Aerospike Connect for Pulsar IOT/edge devices Change Notification: {"metadata":{"namespace":"device","set":"streaming_write_set" ,"digest":"SH0QwiJxdW5Wkf/hAVJGn7Sw37U=","msg":"write","ge n":38,"lut":0,"exp":0},"three":37089,"two":"two_89","one":37089} Change Notification s *Not GA’d Schema Registry
  • 12.
    12 Pulsar VirtualSummit North America 2021 Data Preparation Model Training Third Party Data Exploratory Data Analysis Parameter Tuning Data Scientist Model Validation MODEL SERVING Speeding up Training Pipeline (Conceptual View) CONNECT for Spark Aerospike Database System of Record AI/ML Platform ML Application HTTP 1 2 4 3
  • 13.
    13 Pulsar VirtualSummit North America 2021 Real-time Inference (Conceptual View) Edge Systems across Datacenters Data Preparation HTTP Model Serving Predictions ML Application Predictions Aerospike Database Core System Streaming Source CONNECT for Pulsar CONNECT for Pulsar Application Specialist Aerospike Database Edge Location 1 Aerospike Database Edge Location n XDR CONNECT for Spark HTTP API API API Pulsar Spark Connector
  • 14.
    14 Pulsar VirtualSummit North America 2021 Massive Parallelization ✔80% reduction in Spark Job Execution time ✔Reduced training time ✔Increase frequency of retraining Operational reliability at extreme scale ✔13B Objects ✔150 TB unique data – multiple times a day Increased ROI ✔Only 33 Aerospike servers ✔Increased utilization of Spark Cluster (300 nodes and 7,500 cores) Massive Parallelism w/ Aerospike and Spark CASE STUDY: “We were using custom code before which led to data quality issues and a complex data infrastructure. With Aerospike, we are processing Spark jobs that used to take 12 hours now in just 2.4. Senior Director, Data Science and Engineering Top Global Ad Tech company GLOBAL AD TECH COMPANY
  • 15.
    15 Pulsar VirtualSummit North America 2021 Execute Spark jobs faster with massive parallelism 1. Reduce Training Time 3. Increase Frequency of Re-Training Conduct in-place data exploration Create low latency and high throughput streaming pipeline 1 2 3 The Aerospike Difference for AI/ML Eliminate compliance headaches by removing the need to copy data into multiple systems “Aerospike is second to none for ingesting and persisting millions of events per second… (Aerospike) allows me to do near-instantaneous machine learning on the data as it lands.” Theresa Melvin Chief Architect of AI-Driven Big Data Solutions, HPE 2. Maximize ROI Aerospike data platform connects readily to Spark and Pulsar
  • 16.
    16 Pulsar VirtualSummit North America 2021 Thank you We are hiring for our India and the US offices. https://aerospike.com/solutions/use-cases/ai-ml/