Anúncio

[March sn meetup] apache pulsar + apache nifi for cloud data lake

Developer Advocate em StreamNative
11 de Mar de 2022
Anúncio

Mais conteúdo relacionado

Apresentações para você(20)

Similar a [March sn meetup] apache pulsar + apache nifi for cloud data lake(20)

Anúncio

Mais de Timothy Spann(20)

Anúncio

[March sn meetup] apache pulsar + apache nifi for cloud data lake

  1. Welcome to Apache Pulsar and Apache NiFi for Cloud Data Lakes In the meantime: ● (1) Use the chat to let us know where you’re calling in from ● (2) Take part in our our poll, under the booth “poll” tab in the right panel of Hopin We’ll start in 5min we’re just waiting for People to sign in
  2. Agenda 01 02 03 04 05 Intro to Apache Pulsar - Tim Spann Intro to Apache NiFi - John Kuchmek Demo Key Takeaways + Resources Additional Q&A
  3. Tim Spann Developer Advocate ● FLiP(N) Stack = Flink, Pulsar and NiFi Combined ● Streaming Systems & Data Architecture Expert ● Pulsar, Flink, Spark, NiFi, Big Data, Cloud, MXNet, IoT, Java, Python, Sensors and more. Tim Spann Developer Advocate StreamNative
  4. Tim Spann Developer Advocate ● Integration of OT & IT Data ● Cloudera Streaming SME ● NiFi, Spark, Flink, Kafka, Storm, Druid, Kudu, Python, Sensors, PLCs, Private Cloud and Public Cloud John Kuchmek Principal Solutions Engineer Cloudera
  5. Apache Pulsar is a Cloud-Native Messaging and Event-Streaming Platform.
  6. Why Apache Pulsar? Unified Messaging Platform Guaranteed Message Delivery Resiliency Infinite Scalability
  7. Component Description Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data can also conform to data schemas. Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like topic compaction. Properties An optional key/value map of user-defined properties. Producer name The name of the producer who produces the message. If you do not specify a producer name, the default name is used. Message De-Duplication. Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the message is its order in that sequence. Message De-Duplication. Messages - the basic unit of Pulsar
  8. Connectivity • Libraries - (Java, Python, Go, NodeJS, WebSockets, C++, C#, Scala, Rust,...) • Functions - Lightweight Stream Processing (Java, Python, Go) • Connectors - Sources & Sinks (Cassandra, Kafka, …) • Protocol Handlers - AoP (AMQP), KoP (Kafka), MoP (MQTT) • Processing Engines - Flink, Spark, Presto/Trino via Pulsar SQL, NiFi • Data Offloaders - Tiered Storage - (S3) hub.streamnative.io
  9. Apache NiFi Pulsar Connector https://github.com/streamnative/pulsar-nifi-bundle
  10. Apache NiFi is a GUI based Data Flow tool that runs anywhere.
  11. Why NiFi • Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance ACQUIRE PROCESS DELIVER • Over 350 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure • Guaranteed Delivery • Full data provenance • Eco-system integration Advanced tooling to industrialize flow development (Flow Development Life Cycle) FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG HASH MERGE EXTRACT DUPLICATE SPLIT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT CONTROL RATE DISTRIBUTE LOAD GEOENRICH SCAN REPLACE TRANSLATE CONVERT ENCRYPT TALL EVALUATE EXECUTE
  12. Apache NiFi Capabilities Data Ingest Data Transformation Data Enrichment HTTP Syslog HL7 UDP SFTP MQTT WS Hash Compress Merge Duplicate Split Encrypt Syslog REST Mapcach Enrich IP GeoIP XML
  13. Flow Development Lifecycle
  14. FLiP Stack Weekly This week in Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark and open source friends. https://bit.ly/32dAJft
  15. Demo
  16. Streaming NFTs NFTs NFT Thermal Aggregates Status APIs CRYPTO FEEDS Weather NFT++
  17. Streaming NFTs https://opensea.io/collection/tspannhw-collection
  18. Key Takeaways ● Real-time ingest and data manipulation ● Easy to use/configure processors and controller services ● Multiple ways to connect
  19. Resources Learn More about Nifi + Pulsar Integration https://streamnative.io/apache-nifi-connector/ Github https://github.com/tspannhw/awesome-nifi-pulsar Blogpost on Apache Pulsar + Nifi Integration https://hubs.ly/Q015PNMd0
  20. ● StreamNative: Pulsar-as-a-Service ● AWS Certified Associate Solutions Architect ● Reach me at doug@streamnative.io Doug Cohen Head of Sales, StreamNative Additional Resources
  21. Let’s Keep in Touch! Tim Spann Developer Advocate @PaaSDev linkedin.com/in/timo thyspann github.com/tspannhw John Kuchmek Principal Solutions Engineer @K_Physics linkedin.com/in/jkuch mek github.com/johnkuch
  22. Pulsar Subscription Modes Different subscription modes have different semantics: Exclusive/Failover - guaranteed order, single active consumer Shared - multiple active consumers, no order Key_Shared - multiple active consumers, order for given key Producer 1 Producer 2 Pulsar Topic Subscription D Consumer D-1 Consumer D-2 Key-Shared < K 1, V 10 > < K 1, V 11 > < K 1, V 12 > < K 2 ,V 2 0 > < K 2 ,V 2 1> < K 2 ,V 2 2 > Subscription C Consumer C-1 Consumer C-2 Shared < K 1, V 10 > < K 2, V 21 > < K 1, V 12 > < K 2 ,V 2 0 > < K 1, V 11 > < K 2 ,V 2 2 > Subscription A Consumer A Exclusive Subscription B Consumer B-1 Consumer B-2 In case of failure in Consumer B-1 Failover
  23. Schema Registry Schema Registry schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3 (value=Avro/Protobuf/JSON) Schema Data ID Local Cache for Schemas + Schema Data ID + Local Cache for Schemas Send schema-1 (value=Avro/Protobuf/JSON) data serialized per schema ID Send (register) schema (if not in local cache) Read schema-1 (value=Avro/Protobuf/JSON) data deserialized per schema ID Get schema by ID (if not in local cache) Producers Consumers
  24. ● Buffer ● Batch ● Route ● Filter ● Aggregate ● Enrich ● Replicate ● Dedupe ● Decouple ● Distribute
  25. Messaging Ideal for work queues that do not require tasks to be performed in a particular order—for example, sending one email message to many recipients. RabbitMQ and Amazon SQS are examples of popular queue-based message systems. Pulsar: Unified Messaging + Data Streaming
  26. Messaging Ideal for work queues that do not require tasks to be performed in a particular order—for example, sending one email message to many recipients. RabbitMQ and Amazon SQS are examples of popular queue-based message systems. Pulsar: Unified Messaging + Data Streaming .. and Streaming Works best in situations where the order of messages is important—for example, data ingestion. Kafka and Amazon Kinesis are examples of messaging systems that use streaming semantics for consuming messages.
  27. Pulsar Instance Pulsar Cluster Pulsar Instance Pulsar Cluster
  28. A Unified Messaging Platform Message Queuing Data Streaming
  29. Topics Tenants (Compliance) Tenants (Data Services) Namespace (Microservices) Topic-1 (Cust Auth) Topic-1 (Location Resolution) Topic-2 (Demographics) Topic-1 (Budgeted Spend) Topic-1 (Acct History) Topic-1 (Risk Detection) Namespace (ETL) Namespace (Campaigns) Namespace (ETL) Tenants (Marketing) Namespace (Risk Assessment) Pulsar Instance Pulsar Cluster
  30. Pulsar’s Publish-Subscribe model Broker Subscription Consumer 1 Consumer 2 Consumer 3 Topic Producer 1 Producer 2 ● Producers send messages. ● Topics are an ordered, named channel that producers use to transmit messages to subscribed consumers. ● Messages belong to a topic and contain an arbitrary payload. ● Brokers handle connections and routes messages between producers / consumers. ● Subscriptions are named configuration rules that determine how messages are delivered to consumers. ● Consumers receive messages.
  31. Producer-Consumer Producer Consumer Publisher sends data and doesn't know about the subscribers or their status. All interactions go through Pulsar and it handles all communication. Subscriber receives data from publisher and never directly interacts with it Topic Topic
  32. Kafka On Pulsar (KoP)
  33. streamnative.io MQTT On Pulsar (MoP)
  34. Pulsar Functions ● Lightweight computation similar to AWS Lambda. ● Specifically designed to use Apache Pulsar as a message bus. ● Function runtime can be located within Pulsar Broker. A serverless event streaming framework
  35. streamnative.io ● Consume messages from one or more Pulsar topics. ● Apply user-supplied processing logic to each message. ● Publish the results of the computation to another topic. ● Support multiple programming languages (Java, Python, Go) ● Can leverage 3rd-party libraries to support the execution of ML models on the edge. Pulsar Functions
  36. Pulsar SQL Presto/Trino workers can read segments directly from bookies (or offloaded storage) in parallel. Bookie 1 Segment 1 Producer Consumer Broker 1 Topic1-Part1 Broker 2 Topic1-Part2 Broker 3 Topic1-Part3 Segment 2 Segment 3 Segment 4 Segment X Segment 1 Segment 1 Segment 1 Segment 3 Segment 3 Segment 3 Segment 2 Segment 2 Segment 2 Segment 4 Segment 4 Segment 4 Segment X Segment X Segment X Bookie 2 Bookie 3 Query Coordinator ... ... SQL Worker SQL Worker SQL Worker SQL Worker Query Topic Metadata
  37. Use Cases Multi-Tenant Data Infrastructure AdTech Fraud Detection Connected Car IoT Analytics Data Lake Hydration
  38. Apache NiFi
  39. Apache NiFi Pulsar Connector https://github.com/streamnative/pulsar-nifi-bundle
  40. Apache NiFi Pulsar Connector https://github.com/david-streamlio/pulsar-nifi-bundle
  41. Apache NiFi Pulsar Connector
  42. Apache NiFi Pulsar Connector
  43. Apache NiFi Pulsar Connector
  44. StreamNative Cloud
  45. streamnative.io Passionate and dedicated team. Founded by the original developers of Apache Pulsar. StreamNative helps teams to capture, manage, and leverage data using Pulsar’s unified messaging and streaming platform.
  46. Founded By The Creators Of Apache Pulsar Sijie Guo ASF Member Pulsar/BookKeeper PMC Founder and CEO Jia Zhai Pulsar/BookKeeper PMC Co-Founder Matteo Merli ASF Member Pulsar/BookKeeper PMC CTO Data veterans with extensive industry experience
  47. REST Feed Non-Fungible Token {"date":"Thu, 24 Feb 2022 22:26:41 GMT","short_description":"","featured":"false","image_thumbnail_url":"htt","asset_contract_created_date":"2022-02-17T15:4 8:44.822206","asset_contract_owner":"50299352","image_preview_url":"https://lh3.googleus","asset_contract_symbol":"TD ","twitter_username":"","description":"10,000metaverse-readyAvatars","asset_contract_address":"0xc7df86762ba83f2a619 7e1ff9bb40ae0f696b9e6","external_url":"https://www.sandbox.game/en/snoopdogg/","token_id":"492","asset_contract_na me":"Theoggies","asset_contract_nft_version":"3.0","asset_contract_description":"metaverse.","asset_contract_external_lin k":"https://www.sandbox.game/en/snoopdogg/","id":"307922619","featured_image_url":"https","slug":"snoop-dogg-doggie s","token_metadata":"https://contracts.sandbox.game/unrevealed.json?tokenId=492","asset_contract_schema_name":"ER C721","animation_url":"https","num_sales":"1","image_url":"https://lh","asset_contract_default_to_fiat":"false","external_link": "","image_original_url":"https://contracts.sandbox.game/preview.png","asset_contract_payout_address":"0x4489590a1166 18b506f0efe885432f6a8ed998e9","animation_original_url":"https://con","background_color":"","asset_contract_asset_cont ract_type":"non-fungible","name":"The Doggies","asset_contract_image_url":"https","asset_contract_total_supply":"0"} https://docs.opensea.io/reference/retrieving-bundles
  48. StreamNative Hub StreamNative Cloud Unified Batch and Stream STORAGE Offload (Queuing + Streaming) Apache Pulsar - Apache NiFi <-> Events <-> Cloud Data Stores Tiered Storage Pulsar --- KoP --- MoP --- Websocket --- HTTP Pulsar Sink Pulsar Sink Data Gateway Protocols Data to Cloud Data Lake Micro Service (Queuing + Streaming)
  49. StreamNative Cloud Tiered Storage
  50. (Queuing + Streaming)
  51. (Queuing + Streaming) Tiered Storage (Queuing + Streaming)
Anúncio