Anúncio
Anúncio

Mais conteúdo relacionado

Similar a All Day DevOps - FLiP Stack for Cloud Data Lakes(20)

Mais de Timothy Spann(20)

Anúncio

All Day DevOps - FLiP Stack for Cloud Data Lakes

  1. TRACK: MODERN INFRASTRUCTURE NOVEMBER 10, 2022 Timothy Spann, StreamNative FLiP Stack for Cloud Data Lakes
  2. TRACK: MODERN INFRASTRUCTURE Tim Spann Developer Advocate Tim Spann, Developer Advocate at StreamNative ● FLiP(N) Stack = Flink, Pulsar and NiFi Stack ● Streaming Systems & Data Architecture Expert ● Experience: ○ 15+ years of experience with streaming technologies including Pulsar, Flink, Spark, NiFi, Big Data, Cloud, MXNet, IoT and more. ○ Today, he helps to grow the Pulsar community sharing rich technical knowledge and experience at both global conferences and through individual conversations.
  3. TRACK: MODERN INFRASTRUCTURE FLiP Stack Weekly This week in Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark and open source friends. https://bit.ly/32dAJft
  4. TRACK: MODERN INFRASTRUCTURE
  5. TRACK: MODERN INFRASTRUCTURE Apache Pulsar Serverless computing framework. Unbounded storage, multi-tiered architecture, and tiered-storage. Streaming & Pub/Sub messaging semantics. Multi-protocol support. Open Source Cloud-Native
  6. TRACK: MODERN INFRASTRUCTURE Why Apache Pulsar? Unified Messaging Platform Guaranteed Message Delivery Resiliency Infinite Scalability
  7. TRACK: MODERN INFRASTRUCTURE Unified Messaging Model Simplify your data infrastructure and enable new use cases with queuing and streaming capabilities in one platform. Multi-tenancy Enable multiple user groups to share the same cluster, either via access control, or in entirely different namespaces. Scalability Decoupled data computing and storage enable horizontal scaling to handle data scale and management complexity. Geo-replication Support for multi-datacenter replication with both asynchronous and synchronous replication for built-in disaster recovery. Tiered storage Enable historical data to be offloaded to cloud-native storage and store event streams for indefinite periods of time. Pulsar Benefits
  8. TRACK: MODERN INFRASTRUCTURE Messaging Ideal for work queues that do not require tasks to be performed in a particular order—for example, sending one email message to many recipients. RabbitMQ and Amazon SQS are examples of popular queue-based message systems. Pulsar: Unified Messaging + Data Streaming
  9. TRACK: MODERN INFRASTRUCTURE Messaging Ideal for work queues that do not require tasks to be performed in a particular order—for example, sending one email message to many recipients. RabbitMQ and Amazon SQS are examples of popular queue-based message systems. .. and Streaming Works best in situations where the order of messages is important—for example, data ingestion. Kafka and Amazon Kinesis are examples of messaging systems that use streaming semantics for consuming messages. Pulsar: Unified Messaging + Data Streaming
  10. TRACK: MODERN INFRASTRUCTURE Tenants (Compliance) Tenants (Data Services) Namespace (Microservices) Topic-1 (Cust Auth) Topic-1 (Location Resolution) Topic-2 (Demographics) Topic-1 (Budgeted Spend) Topic-1 (Acct History) Topic-1 (Risk Detection) Namespace (ETL) Namespace (Campaigns) Namespace (ETL) Tenants (Marketing) Namespace (Risk Assessment) Pulsar Cluster
  11. TRACK: MODERN INFRASTRUCTURE Pulsar’s Publish-Subscribe model Broker Subscription Consumer 1 Consumer 2 Consumer 3 Topic Producer 1 Producer 2 ● Producers send messages. ● Topics are an ordered, named channel that producers use to transmit messages to subscribed consumers. ● Messages belong to a topic and contain an arbitrary payload. ● Brokers handle connections and routes messages between producers / consumers. ● Subscriptions are named configuration rules that determine how messages are delivered to consumers. ● Consumers receive messages.
  12. TRACK: MODERN INFRASTRUCTURE Pulsar Subscription Modes Different subscription modes have different semantics: Exclusive/Failover - guaranteed order, single active consumer Shared - multiple active consumers, no order Key_Shared - multiple active consumers, order for given key Producer 1 Producer 2 Pulsar Topic Subscription D Consumer D-1 Consumer D-2 Key-Shared < K 1 ,V 1 0 > < K 1 ,V 1 1 > < K 1 ,V 1 2 > < K 2 ,V 2 0 > < K 2 ,V 2 1 > < K 2 ,V 2 2 > Subscription C Consumer C-1 Consumer C-2 Shared < K 1 ,V 1 0 > < K 2 ,V 2 1 > < K 1 ,V 1 2 > < K 2 ,V 2 0 > < K 1 ,V 1 1 > < K 2 ,V 2 2 > Subscription A Consumer A Exclusive Subscription B Consumer B-1 Consumer B-2 In case of failure in Consumer B-1 Failover
  13. TRACK: MODERN INFRASTRUCTURE Producer-Consumer Producer Consumer Publisher sends data and doesn't know about the subscribers or their status. All interactions go through Pulsar and it handles all communication. Subscriber receives data from publisher and never directly interacts with it Topic Topic
  14. TRACK: MODERN INFRASTRUCTURE Schema Registry Schema Registry schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3 (value=Avro/Protobuf/JSON) Schema Data ID Local Cache for Schemas + Schema Data ID + Local Cache for Schemas Send schema-1 (value=Avro/Protobuf/JSON) data serialized per schema ID Send (register) schema (if not in local cache) Read schema-1 (value=Avro/Protobuf/JSON) data deserialized per schema ID Get schema by ID (if not in local cache) Producers Consumers
  15. TRACK: MODERN INFRASTRUCTURE Use Pulsar to Stream to Lakehouses
  16. TRACK: MODERN INFRASTRUCTURE Use Pulsar to Stream from Lakehouses
  17. TRACK: MODERN INFRASTRUCTURE • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Supports push and pull models • Hundreds of processors • Visual command and control • Over a 300 sources • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering • Version Control Why Apache NiFi?
  18. TRACK: MODERN INFRASTRUCTURE Apache NiFi Pulsar Connector https://github.com/david-streamlio/pulsar-nifi-bundle
  19. TRACK: MODERN INFRASTRUCTURE Apache NiFi - Data Lineage / Provenance
  20. TRACK: MODERN INFRASTRUCTURE https://www.datainmotion.dev/2021/01/automating-starting-services-in-apache.html https://nipyapi.readthedocs.io/en/latest/ nifi-toolkit/bin/cli.sh nifi list-param-contexts -u http:/ /edge2ai-1.dim.local:8080 nifi-toolkit/bin/cli.sh nifi pg-list -u http:/ /edge2ai-1.dim.local:8080 nifi-toolkit/bin/cli.sh nifi pg-set-param-context -u http:/ /edge2ai-1.dim.local:8080 ... Apache NiFi DevOps Apache NiFi DevOps
  21. TRACK: MODERN INFRASTRUCTURE https://dev.to/tspannhw/automating-starting-services-in-apache-nifi-and-applying-parameters-5h4n https://github.com/tspannhw/ApacheConAtHome2020/blob/main/scripts/setupnifi.sh nifi pg-list nifi pg-status nifi pg-get-services nifi pg-enable-services -u http:/ /edge2ai-1.dim.local:8080 --processGroupId root nifi pg-start -u http:/ /edge2ai-1.dim.local:8080 -pgid LOOKTHISUP nifi list-param-contexts -u http:/ /edge2ai-1.dim.local:8080 -verbose nifi create-reporting-task -u http:/ /edge2ai-1.dim.local:8080 -verbose -i Apache NiFi DevOps
  22. TRACK: MODERN INFRASTRUCTURE
  23. TRACK: MODERN INFRASTRUCTURE Download NiFi Toolkit Copy keystore and truststore information from your NiFi conf/nifi.properties Create a nifi.properties file linked to the cli.sh baseUrl=https://nvidia-desktop:8443 keystore=/home/nvidia/nvme/nifi-1.15.3/conf/keystore.p12 keystoreType=PKCS12 keystorePasswd=5325343412efaab3123c6892d93 keyPasswd=53134eee99da9dbe9349123aa17c6892d93 truststore=/home/nvidia/nvme/nifi-1.15.3/conf/truststore.p12 truststoreType=PKCS12 truststorePasswd=93498Dfdjfhujdhure8d8hfd84j3n43jd Apache NiFi Toolkit Setup
  24. TRACK: MODERN INFRASTRUCTURE ● Unified computing engine ● Batch processing is a special case of stream processing ● Stateful processing ● Massive Scalability ● Flink SQL for queries, inserts against Pulsar Topics ● Streaming Analytics ● Continuous SQL ● Continuous ETL ● Complex Event Processing ● Standard SQL Powered by Apache Calcite Why Apache Flink?
  25. TRACK: MODERN INFRASTRUCTURE SQL select aqi, parameterName, dateObserved, hourObserved, latitude, longitude, localTimeZone, stateCode, reportingArea from airquality; select max(aqi) as MaxAQI, parameterName, reportingArea from airquality group by parameterName, reportingArea; select max(aqi) as MaxAQI, min(aqi) as MinAQI, avg(aqi) as AvgAQI, count(aqi) as RowCount, parameterName, reportingArea from airquality group by parameterName, reportingArea;
  26. TRACK: MODERN INFRASTRUCTURE StreamNative Hub StreamNative Cloud Unified Batch and Stream COMPUTING Batch (Batch + Stream) Unified Batch and Stream STORAGE Offload (Queuing + Streaming) Apache Flink - Apache Pulsar - Apache NiFi <-> Events <-> Cloud Data Stores Tiered Storage Pulsar --- KoP --- MoP --- Websocket --- HTTP Pulsar Sink Pulsar Sink Streaming Data Gateway Protocols Data to Cloud Data Lake Micro Service
  27. TRACK: MODERN INFRASTRUCTURE Monitoring and Metrics Check curl http://localhost:8080/admin/v2/persistent/conf/ete/first/stats | python3 -m json.tool bin/pulsar-admin topics stats-internal persistent://conf/ete/first curl http://pulsar1:8080/metrics/ bin/pulsar-admin topics stats-internal persistent://conf/ete/first bin/pulsar-admin topics peek-messages --count 5 --subscription ete-reader persistent://conf/ete/first bin/pulsar-admin topics subscriptions persistent://conf/ete/first
  28. TRACK: MODERN INFRASTRUCTURE Cleanup bin/pulsar-admin topics delete persistent://conf/ete/first bin/pulsar-admin namespaces delete conf/ete bin/pulsar-admin tenants delete conf
  29. TRACK: MODERN INFRASTRUCTURE Metrics: Broker Broker metrics are exposed under "/metrics" at port 8080. You can change the port by updating webServicePort to a different port in the broker.conf configuration file. All the metrics exposed by a broker are labeled with cluster=${pulsar_cluster}. The name of Pulsar cluster is the value of ${pulsar_cluster}, configured in the broker.conf file. These metrics are available for brokers: ● Namespace metrics ○ Replication metrics ● Topic metrics ○ Replication metrics ● ManagedLedgerCache metrics ● ManagedLedger metrics ● LoadBalancing metrics ○ BundleUnloading metrics ○ BundleSplit metrics ● Subscription metrics ● Consumer metrics ● ManagedLedger bookie client metrics For more information: https://pulsar.apache.org/docs/en/reference-metrics/#broker
  30. TRACK: MODERN INFRASTRUCTURE Let’s Keep in Touch! Tim Spann Developer Advocate @PaaSDev https://www.linkedin.com/in/timothyspann https://github.com/tspannhw
  31. TRACK: MODERN INFRASTRUCTURE
Anúncio