We have long stressed that there is more and more a need for unified messaging and streaming and that Apache Pulsar is the platform that better supports this vision and makes it possible, at a large scale. In his talk, Matteo Merli will show how we can take this messaging & streaming unified paradigm one step further, to fully take advantage of their integration. The result is a drastically simplified architecture: a single system that is able to support the data throughout its entire lifecycle, from when the event is happening down to the historical archiving. The ramifications of this shift are big, as we can see Pulsar is in the perfect spot to enable tighter integration between online and offline worlds.
2. Pulsar Virtual Summit Europe 2021
Matteo Merli
CTO @ StreamNative
Co-Creator and PMC Chair for Apache Pulsar
PMC Member Apache BookKeeper
Prev: Splunk, Streamlio, Yahoo
3. Pulsar Virtual Summit Europe 2021
Pulsar and the data in motion
Messaging
Message passing
between components,
application, services
Streaming
Analyze events that just
happened
4. Pulsar Virtual Summit Europe 2021
Use Cases
Messaging
● OLTP, Integration
○ Main challenges:
○ Latency
○ Availability
○ Data durability
○ High level features
○ Routing, DLQ, delays,
individual acks
Streaming
● Real-time analytics
● Main challenges:
○ Throughput
○ Ordering
○ Stateful processing
○ Batch + Real-Time
5. Pulsar Virtual Summit Europe 2021
How can Pulsar support both?
Scalable Log Storage
+
Flexible messaging semantics
6. 1.
They get applied in different stages
of handling the same data
Why is messaging + streaming so important?
12. Interactions are complex
1. Who is responsible for getting a feed of events?
2. Where is the data stored?
3. How can we feed updates back to online data stores?
4. What happens when systems are not available?
5. How is the schema of data enforced?
6. What is the security model?
13. If we could just share a single data platform…
15. 1. Different kind of services can all share the same Pulsar
cluster
2. Decouple services through topics
3. Provide isolation & availability guarantees
How to tackle integration points
16. 1. Single tooling
2. Supports many different APIs
3. Unified AuthN & AuthZ for access to data
4. Supports end-to-end schema validation
Using Pulsar as the single data platform
17. 1. Keep 1 copy of the data
2. Single source of truth
3. No single component is the “owner” of the data
4. Consumer components can get access directly to the
source
5. There is no need for additional ad-hoc integrations
Removing “data ownership”
18. The data life-cycle
The data can reside in Pulsar for its entire life-
cycle, since its inception and to the long-term
storage.
19. The data life-cycle
1. Events are happening — (real-time)
2. Streaming Analytics — (< 1 second)
3. Data replay — (1 hour / days)
4. Long term storage and Batch Processing —
(days / months)
20. Managing the data life-cycle
1. Store the data only once
2. Make it available to all interested parties
3. Able to hold the data for extended time
21. Managing the data life-cycle
Pulsar provides infinite stream-storage
abstraction
a. Low latency writes
b. Isolation between tail read and catch-up
c. Long term tiered storage
22. Isolating the different workloads
Every aspect of Pulsar is designed for multi-
tenancy and multi-workloads:
● IO Isolation
● Limits to access to resources: throttling, quotas,
etc..