This document discusses optimizing an Apache Pulsar cluster to handle 10 PB of data per day for a financial customer. Initial estimates showed the cluster would need over 1000 VMs using HDD storage. Various optimizations were implemented, including eliminating the journal, using direct I/O, compression, and C++ client optimizations. This reduced the estimated number of needed VMs to 200 using L-SSD storage per VM. The optimized cluster can now meet the customer's requirements of processing 10 PB of data per day with 3 hours of retention and zone failure protection.
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Scaling Apache Pulsar to 10 Petabytes/Day
1. Brought to you by
Scaling Apache
Pulsar to 10 PB/day
Karthik Ramasamy
Senior Director of Engineering at
2. Karthik Ramasamy
Senior Director of Engineering
@karthikz
streaming @splunk | ex-CEO of @streamlio | co-creator of @heronstreaming | ex @Twitter | Ph.D
5. Splunk Data Stream Processor
Detect Data Patterns or Conditions
Mask Sensitive Data
Aggregate Format
Normalize Transform
Filter Enhance
Turn Raw Data Into
High-value Information
Protect Sensitive Data
Distribute Data To Splunk
Or Other Destinations
Data
Warehouse
Public
Cloud
Message
Bus
Splunk DSP
A real time stream processing solution that collects, processes and delivers data to Splunk and
other destinations in milliseconds
6. DSP - Bird’s Eye View
HEC
S2S
Batch
Apache Pulsar
Stream Processing
Engine
External
Systems
REST Client
Forwarders
Data Source
Splunk
Indexer
Apache Pulsar is at the core of DSP
8. ■ Marquee customer is in finance and payments
■ Microservices and applications emit logs
■ Logs contain rich information
■ Process these logs and extract monitoring & tracing information
■ Filter these logs depending on log volume and if there is high value - justifying
retention
■ Compute real time business metrics
Use Cases
9. ■ Environment - Google Cloud Platform
■ Use of n1-standard-32 VMs
■ Raw data ingestion of 10 PB/day that translates ~120 GB/sec
■ Data retention of 3 hours
■ Need to handle the entire traffic load when a zone fails
Data Requirements
11. DSP Deployment
■ Separation of ingestion and computation
■ Pipeline isolation and no noisy neighbor issues
■ Troubleshooting single pipeline gets easier
■ Might not need over provisioning except for peak load + fudge
factor (as compared to deploying a single cluster)
12. ■ 32 vCPUs
■ 120 GB of memory
■ Max number of PDs (EBS equivalent) - 128
■ Max total PD size - 257 TB
■ Max egress network bandwidth - 32 Gbps (4 GBps)
■ Max 24 L-SSDs for a total of 9 TB
VM Configuration - n1-standard-32
15. ■ Replica factor of 3
■ Need to handle 120 GBps of raw traffic
■ Need to handle 360 GBps of storage write bandwidth
■ With journal required write bandwidth 720 GBps
■ Total storage required for retention - 3.9 PB
■ Total ingress network bandwidth - 480 GBps
■ Total egress network bandwidth - 1200 GBps
Apache Pulsar Requirements
16. ■ Size of a Pulsar Cluster for a given workload depends on three
parameters -
■ Storage Density - Aggregate storage capacity needed in the
cluster and proportional to retention of data
■ Storage Bandwidth - Aggregate write throughput and read
throughput needed for data ingestion and consumption. Heavily
depends on storage media
■ Network Bandwidth - Aggregate network bandwidth available in
the cluster for input traffic and output traffic.
Pulsar Cluster Size Estimation
17. Max of 200 MB/sec write
throughput per VM
Max of 9 TB
per instance
Max of 4 GBps egress
and ingress bandwidth
Dominated by
Storage Bandwidth
Estimating VMs using P-HDD
18. Max of 400 MB/sec write
throughput per VM
Max of 9 TB
per instance
Max of 4 GBps egress
and ingress bandwidth
Dominated by
Storage Bandwidth
Estimating VMs using P-SDD
19. Max of 850 MB/sec write
throughput per VM
Max of 9 TB
per instance
Max of 4 GBps egress
and ingress bandwidth
Dominated by
Storage Bandwidth
Estimating VMs using L-SDD
22. Optimization #1 - Eliminating Journal
Since all the data is machine logs, we implemented replicated durability
■ Different types of durability
■ Persistent Durability - No data loss in the presence of nodes failures or
entire cluster failure
■ Replicated Durability - No data loss in the presence of limited nodes
failures
■ Transient Durability - Data loss in the presence of failures
24. Optimization #2 - Direct I/O
■ Overhead of page cache in container environment is pretty high
■ Kernel needs to keep track of the usage quota per container for the
page cache
■ These translate into maintaining additional data structures and
lookups (older kernel had n^2 lookup time for getting pages in & out)
■ Bypassed page cache for BookKeeper entry log, using JNI:
■ We already have in memory caches (write and read-ahead)
■ We have better control on what to cache and when to evict
■ Avoid double buffering
30. Surviving Zone Failure
Segment 1
Segment 2
Segment n
.
.
.
Segment 2
Segment 3
Segment n
.
.
.
Segment 3
Segment 1
Segment n
.
.
.
Storage
Broker
Serving
Broker Broker
■ Zone/Rack Failures
■ Bookies provide rack awareness
■ Broker replicate data to different
racks/zones
■ In the presence of zone/rack failure,
data is available in other zones
■ One zone failure means two zones should
be capable of handling the entire traffic
■ Requires 50% additional VMs
Zone A Zone B Zone C
32. Optimization #4 - C++ Client CPU & Memory Usage
■ Better round robin across partitions - maximizing the batch
size per partition
■ Having bigger batches reduces the cpu usage for client, broker and
bookies
■ Increases the compression factor
■ Reduced client memory usages
■ Optimizations to minimize memory allocation overhead
■ Implemented memory limit in C++ producer
■ Simplifies the user configuration — One single setting instead of
multiple queue sizes and complex math