Apache Pulsar and NiFi for Cloud Data Lakes

Welcome to
Apache Pulsar and Apache NiFi
for Cloud Data Lakes
In the meantime:
● (1) Use the chat to let us know
where you’re calling in from
● (2) Take part in our our poll, under
the booth “poll” tab in the right
panel of Hopin
We’ll start in 5min we’re just waiting for
People to sign in

Agenda
01
02
03
04
05
Intro to Apache Pulsar - Tim Spann
Intro to Apache NiFi - John Kuchmek
Demo
Key Takeaways + Resources
Additional Q&A

Tim Spann
Developer Advocate
● FLiP(N) Stack = Flink, Pulsar and NiFi Combined
● Streaming Systems & Data Architecture Expert
● Pulsar, Flink, Spark, NiFi, Big Data, Cloud, MXNet,
IoT, Java, Python, Sensors and more.
Tim Spann
Developer Advocate
StreamNative

Tim Spann
Developer Advocate
● Integration of OT & IT Data
● Cloudera Streaming SME
● NiFi, Spark, Flink, Kafka, Storm, Druid, Kudu,
Python, Sensors, PLCs, Private Cloud and Public
Cloud
John Kuchmek
Principal Solutions Engineer
Cloudera

Apache Pulsar is a Cloud-Native
Messaging and Event-Streaming Platform.

Why Apache Pulsar?
Uniﬁed
Messaging Platform
Guaranteed
Message Delivery Resiliency Inﬁnite
Scalability

Component Description
Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although
message data can also conform to data schemas.
Key Messages are optionally tagged with keys, used in partitioning and also is useful for
things like topic compaction.
Properties An optional key/value map of user-defined properties.
Producer name The name of the producer who produces the message. If you do not specify a producer
name, the default name is used. Message De-Duplication.
Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of
the message is its order in that sequence. Message De-Duplication.
Messages - the basic unit of Pulsar

Connectivity
• Libraries - (Java, Python, Go, NodeJS, WebSockets,
C++, C#, Scala, Rust,...)
• Functions - Lightweight Stream Processing (Java,
Python, Go)
• Connectors - Sources & Sinks (Cassandra, Kafka, …)
• Protocol Handlers - AoP (AMQP), KoP (Kafka), MoP
(MQTT)
• Processing Engines - Flink, Spark, Presto/Trino via
Pulsar SQL, NiFi
• Data Ofﬂoaders - Tiered Storage - (S3)
hub.streamnative.io

Apache NiFi Pulsar Connector
https://github.com/streamnative/pulsar-nifi-bundle

Apache NiFi is a GUI based Data Flow
tool that runs anywhere.

Why NiFi
• Enable easy ingestion, routing, management and delivery of any data anywhere (Edge,
cloud, data center) to any downstream system with built in end-to-end security and
provenance
ACQUIRE PROCESS DELIVER
• Over 350 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
• Guaranteed Delivery
• Full data provenance
• Eco-system integration
Advanced tooling to industrialize ﬂow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE

Apache NiFi Capabilities
Data Ingest Data Transformation Data Enrichment
HTTP
Syslog
HL7
UDP
SFTP
MQTT
WS
Hash
Compress
Merge
Duplicate
Split
Encrypt
Syslog
REST
Mapcach
Enrich IP
GeoIP
XML

FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar, Apache
NiFi, Apache Spark and open source friends.
https://bit.ly/32dAJft

Streaming NFTs
NFTs
NFT
Thermal
Aggregates
Status
APIs
CRYPTO
FEEDS
Weather
NFT++

Streaming NFTs
https://opensea.io/collection/tspannhw-collection

Key Takeaways
● Real-time ingest and data manipulation
● Easy to use/conﬁgure processors and
controller services
● Multiple ways to connect

Resources
Learn More about Nifi + Pulsar Integration
https://streamnative.io/apache-nifi-connector/
Github
https://github.com/tspannhw/awesome-nifi-pulsar
Blogpost on Apache Pulsar + Nifi Integration
https://hubs.ly/Q015PNMd0

● StreamNative: Pulsar-as-a-Service
● AWS Certiﬁed Associate Solutions
Architect
● Reach me at doug@streamnative.io
Doug Cohen
Head of Sales, StreamNative
Additional
Resources

Let’s Keep
in Touch!
Tim Spann
Developer Advocate
@PaaSDev
linkedin.com/in/timo
thyspann
github.com/tspannhw
John Kuchmek
Principal
Solutions Engineer
@K_Physics
linkedin.com/in/jkuch
mek
github.com/johnkuch

Pulsar Subscription Modes
Different subscription modes
have different semantics:
Exclusive/Failover - guaranteed
order, single active consumer
Shared - multiple active
consumers, no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2,
V
21
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover

Schema Registry
Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers

● Buffer
● Batch
● Route
● Filter
● Aggregate
● Enrich
● Replicate
● Dedupe
● Decouple
● Distribute

Messaging
Ideal for work queues that do not
require tasks to be performed in a
particular order—for example, sending
one email message to many recipients.
RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
Pulsar: Uniﬁed Messaging + Data Streaming

Messaging
Ideal for work queues that do not
require tasks to be performed in a
particular order—for example, sending
one email message to many recipients.
RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
Pulsar: Uniﬁed Messaging + Data Streaming
.. and Streaming
Works best in situations where the order
of messages is important—for example,
data ingestion.
Kafka and Amazon Kinesis are examples
of messaging systems that use streaming
semantics for consuming messages.

Pulsar Instance
Pulsar Cluster
Pulsar Instance
Pulsar Cluster

A Uniﬁed Messaging Platform
Message Queuing
Data Streaming

Topics
Tenants
(Compliance)
Tenants
(Data Services)
Namespace
(Microservices)
Topic-1
(Cust Auth)
Topic-1
(Location Resolution)
Topic-2
(Demographics)
Topic-1
(Budgeted Spend)
Topic-1
(Acct History)
Topic-1
(Risk Detection)
Namespace
(ETL)
Namespace
(Campaigns)
Namespace
(ETL)
Tenants
(Marketing)
Namespace
(Risk Assessment)
Pulsar Instance
Pulsar Cluster

Pulsar’s Publish-Subscribe model
Broker
Subscription
Consumer 1
Consumer 2
Consumer 3
Topic
Producer 1
Producer 2
● Producers send messages.
● Topics are an ordered, named channel that producers
use to transmit messages to subscribed consumers.
● Messages belong to a topic and contain an arbitrary
payload.
● Brokers handle connections and routes
messages between producers / consumers.
● Subscriptions are named conﬁguration rules
that determine how messages are delivered to
consumers.
● Consumers receive messages.

Producer-Consumer
Producer Consumer
Publisher sends data and
doesn't know about the
subscribers or their status.
All interactions go through
Pulsar and it handles all
communication.
Subscriber receives data
from publisher and never
directly interacts with it
Topic
Topic

streamnative.io
MQTT
On Pulsar
(MoP)

Pulsar Functions
● Lightweight
computation similar to
AWS Lambda.
● Speciﬁcally designed to
use Apache Pulsar as a
message bus.
● Function runtime can
be located within
Pulsar Broker.
A serverless event streaming
framework

streamnative.io
● Consume messages from one
or more Pulsar topics.
● Apply user-supplied
processing logic to each
message.
● Publish the results of the
computation to another topic.
● Support multiple
programming languages (Java,
Python, Go)
● Can leverage 3rd-party
libraries to support the
execution of ML models on
the edge.
Pulsar Functions

Pulsar SQL
Presto/Trino workers can read
segments directly from
bookies (or ofﬂoaded storage)
in parallel.
Bookie
1
Segment 1
Producer Consumer
Broker 1
Topic1-Part1
Broker 2
Topic1-Part2
Broker 3
Topic1-Part3
Segment 2 Segment 3 Segment 4 Segment X
Segment 1
Segment 1 Segment 1
Segment 3 Segment 3
Segment 3
Segment 2
Segment 2
Segment 2
Segment 4
Segment 4
Segment 4
Segment X
Segment X
Segment X
Bookie
2
Bookie
3
Query
Coordinator
...
...
SQL Worker SQL Worker SQL Worker
SQL Worker
Query
Topic
Metadata

Use Cases
Multi-Tenant Data
Infrastructure
AdTech
Fraud Detection
Connected Car
IoT Analytics
Data Lake Hydration

Apache NiFi Pulsar Connector
https://github.com/david-streamlio/pulsar-nifi-bundle

streamnative.io
Passionate and dedicated team.
Founded by the original developers of
Apache Pulsar.
StreamNative helps teams to capture,
manage, and leverage data using Pulsar’s
uniﬁed messaging and streaming
platform.

Founded By The
Creators Of Apache Pulsar
Sijie Guo
ASF Member
Pulsar/BookKeeper PMC
Founder and CEO
Jia Zhai
Co-Founder
Matteo Merli
ASF Member
CTO
Data veterans with extensive industry experience

REST Feed Non-Fungible Token
{"date":"Thu, 24 Feb 2022 22:26:41
GMT","short_description":"","featured":"false","image_thumbnail_url":"htt","asset_contract_created_date":"2022-02-17T15:4
8:44.822206","asset_contract_owner":"50299352","image_preview_url":"https://lh3.googleus","asset_contract_symbol":"TD
","twitter_username":"","description":"10,000metaverse-readyAvatars","asset_contract_address":"0xc7df86762ba83f2a619
7e1ff9bb40ae0f696b9e6","external_url":"https://www.sandbox.game/en/snoopdogg/","token_id":"492","asset_contract_na
me":"Theoggies","asset_contract_nft_version":"3.0","asset_contract_description":"metaverse.","asset_contract_external_lin
k":"https://www.sandbox.game/en/snoopdogg/","id":"307922619","featured_image_url":"https","slug":"snoop-dogg-doggie
s","token_metadata":"https://contracts.sandbox.game/unrevealed.json?tokenId=492","asset_contract_schema_name":"ER
C721","animation_url":"https","num_sales":"1","image_url":"https://lh","asset_contract_default_to_ﬁat":"false","external_link":
"","image_original_url":"https://contracts.sandbox.game/preview.png","asset_contract_payout_address":"0x4489590a1166
18b506f0efe885432f6a8ed998e9","animation_original_url":"https://con","background_color":"","asset_contract_asset_cont
ract_type":"non-fungible","name":"The Doggies","asset_contract_image_url":"https","asset_contract_total_supply":"0"}
https://docs.opensea.io/reference/retrieving-bundles

StreamNative Hub
StreamNative Cloud
Uniﬁed Batch and Stream STORAGE
Offload
(Queuing + Streaming)
Apache Pulsar - Apache NiFi <-> Events <-> Cloud Data Stores
Tiered Storage
Pulsar
---
KoP
---
MoP
---
Websocket
---
HTTP
Pulsar
Sink
Pulsar
Sink
Data Gateway
Protocols
Data to Cloud Data Lake
Micro
Service

StreamNative Cloud
Tiered Storage

Tiered Storage

Apache Pulsar and NiFi for Cloud Data Lakes

Apache Pulsar and NiFi for Cloud Data Lakes

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Apache Pulsar and NiFi for Cloud Data Lakes

Semelhante a Apache Pulsar and NiFi for Cloud Data Lakes (20)

Mais de Timothy Spann

Mais de Timothy Spann (20)

Último

Último (20)

Apache Pulsar and NiFi for Cloud Data Lakes