Virtual Flink Forward 2020: Build your next-generation stream platform based on Apache Pulsar - Neng Lu

Build your next- generation stream
platform based on Apache Pulsar
Neng Lu (@nlu90)

Who am I
❏ StreamNative Software Engineer
❏ Ex-Twitter
❏ Interested in event streaming technologies

“Flexible Pub/Sub Messaging
Backed by durable log storage”

A brief history of Apache Pulsar
❏ 2012: Pulsar idea started
❏ 5+ years on production, 100+ applications, 10+ data centers
❏ 2016/09 Yahoo open sourced Pulsar
❏ 2017/06 Yahoo donated Pulsar to ASF
❏ 2018/09 Pulsar graduated as a Top-Level project
❏ 25+ committers, 253 contributors, 1.4K+ forks, 5.4K+ stars
❏ Yahoo!, Yahoo! Japan, Tencent, Zhaopin, ...

Pulsar Use Cases
❏ Unified Event Center/Bus (Queuing + Streaming)
❏ Billing Service
❏ Push Notification
❏ Worker Queue
❏ Logging Pipeline
❏ IoT
❏ Streaming-first, unified data processing
❏ ...

Data Processing with Apache Pulsar

Data Processing Categories
❏ Interactive
❏ Time critical
❏ Medium data size
❏ Rerun on failures

❏ Interactive
❏ Time critical
❏ Batch
❏ The amount of data is huge
❏ Can run on a huge cluster
❏ Fine-grained fault tolerance

❏ Interactive
❏ Time critical
❏ Batch
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ Need scalability as well as
resilient on failures

❏ Interactive
❏ Time critical
❏ Batch
❏ Streaming
❏ Long running jobs
❏ Time critical
❏ Need scalability as well as
resilient on failures
❏ Serverless
❏ Simple, light-weight processing
❏ Processing data with high
velocity

Streaming-First
Batch processing is a special case of stream processing
A Flink view on computing

Infinite segmented streams
(pub/sub + segment)
A Pulsar view on data

+
=
Streaming-first, unified data processing

Pulsar - A cloud-native architecture
Stateless Serving
Durable Storage

Pulsar - Segment Centric Storage
❏ Topic Partition (Managed Ledger)
❏ The storage layer for a single topic
partition
❏ Segment (Ledger)
❏ Single writer, append-only
❏ Replicated to multiple bookies

Pulsar - Infinite stream storage

Pulsar - Stream as a unified view on data

Pulsar - Two levels of reading API
❏ Pub/Sub (Streaming)
❏ Read data from brokers
❏ Consume / Seek / Receive
❏ Subscription Mode - Failover, Shared, Key_Shared
❏ Reprocessing data by rewinding (seeking) the cursors
❏ Segment (Batch)
❏ Read data from storage (bookkeeper or tiered storage)
❏ Fine-grained Parallelism
❏ Predicate pushdown (publish timestamp)

Unified data processing on Pulsar

Flink Integration
❏ Available Connectors
❏ Streaming Source
❏ Streaming Sink
❏ Table Sink
❏ Flink 1.6.0
When Flink & Pulsar come together: https://flink.apache.org/2019/05/03/pulsar-flink.html

Flink 1.9 Integration
❏ Pulsar Schema Integration
❏ Table API as first-class citizens
❏ Exactly-once source
❏ At-least-once sink

Pulsar Schema (1)
❏ Consensus of data at server-side
❏ Built-in schema registry
❏ Data schema on a per-topic basis
❏ Send and receive typed messages directly
❏ Validation
❏ Multi-version
❏ Schema evolution & compatibilities

Pulsar Schema (2)
// Create producer with Struct schema and send messages
Producer<User> producer = client.newProducer(Schema.AVRO(User.class)).create();
producer.newMessage()
.value(User.builder()
.userName("pulsar-user")
.userId(1L)
.build())
.send();
// Create consumer with Struct schema and receive messages
Consumer<User> consumer = client.newConsumer(Schema.AVRO(User.class)).create();
consumer.receive();

Pulsar Schema (3) - SchemaInfo
{
"type": "JSON",
"schema": "{
"type":"record",
"name":"User",
"namespace":"com.foo",
"fields":[
{
"name":"file1",
"type":["null","string"],
"default":null
},
{
"name":"file2",
"type":"string",
"default":null
},
{
"name":"file3",
"type":["null","string"],
"default":"dfdf"
}
]
}",
"properties": {}
}

Pulsar Schema (6) - Compatibility Strategy

Pulsar Schema (7) - Multi versions

Pulsar-Flink (1) - Schema <-> Row
https://github.com/streamnative/pulsar-flink
● Topics without schema or with primitive schemas
○ `value` field for message payload
● Topics with struct schemas (AVRO, JSON)
○ Field names and types are kept in the row
● Metadata Fields
○ __key: Binary
○ __topic: String
○ __messageId: Binary
○ __publishTime: Timestamp
○ __eventTime: Timestamp

Pulsar-Flink (2) - Schema Examples
Primitive Schema Avro Schema
https://github.com/streamnative/pulsar-flink

Pulsar-Flink (3) - Pulsar Source

Pulsar-Flink (4) - Streaming Tables

Pulsar-Flink (5) - Topic Partitions Discovery
● Find matching topics
● Fetch schemas for each topic
● Build schema-specific deserializer
● Each reader is responsible one
topic partition
● Each source task has a partition
discover task to check newly added
partitions

Pulsar-Flink (6) Exactly-once Source
● Message order on partition basis
● Seek & read
● Checkpoints with MessageID
● Durable cursor to keep
un-checkpointed messages alive
● Move cursor when a checkpoint is
completed

Pulsar-Flink (7) - Pulsar Sink

Pulsar-Flink (8) - Write to streaming tables

Future directions
❏ Unified Source API for both batch and streaming execution
❏ FLIP-27
❏ Pulsar as a catalog
❏ Pulsar as a state backend
❏ Scale-out source parallelism
❏ Key_Shared & Sticky consumer
❏ End-to-end exactly-once
❏ Pulsar transaction in 2.5.0

Key_Shared Subscription
❏ Key based ordering
❏ Key can be message key or a separated *order* key
❏ HashRing based routing
❏ Key based batcher
❏ Policies for messages without *keys*
https://github.com/apache/pulsar/wiki/PIP-34:-Add-new-subscribe-type-Key_shared

Conclusion
❏ Apache Pulsar is a cloud-native messaging streaming system
❏ Multi layered architecture
❏ Segment centric storage
❏ Two levels of reading API: Pub/Sub + Segment
❏ Apache Pulsar provides a unified view of data
❏ Apache Flink provides a unified view of computing
❏ Pulsar + Flink for streaming-first, unified data processing

Community
❏ Pulsar Website: https://pulsar.apache.org
❏ Twitter: @apache_pulsar / @streamnativeio
❏ Slack: https://apache-pulsar.herokuapp.com
❏ Mailing Lists
dev@pulsar.apache.org, users@pulsar.apache.org
❏ Github
https://github.com/apache/pulsar
❏ Medium
https://medium.com/streamnative

Virtual Flink Forward 2020: Build your next-generation stream platform based on Apache Pulsar - Neng Lu

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Virtual Flink Forward 2020: Build your next-generation stream platform based on Apache Pulsar - Neng Lu

Similar to Virtual Flink Forward 2020: Build your next-generation stream platform based on Apache Pulsar - Neng Lu (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Virtual Flink Forward 2020: Build your next-generation stream platform based on Apache Pulsar - Neng Lu