Mais conteúdo relacionado Mais de Naoki (Neo) SATO (20) [Apache Kafka Meetup Japan #7] Kafka on Azure1. ApacheKafkaMeetupJapan#7
Kafka on Azure
~ MicrosoftAzureが提供するマネージドKafkaサービスを使ってみよう~
SATO Naoki / 佐藤 直生
Azure Technologist / Cloud Solution Architect, Microsoft
Twitter @satonaoki / https://satonaoki.wordpress.com/
3. © Microsoft Corporation
AI built-in | Most secure | Lowest TCO
Data warehouses
Data lakes
Operational databases
Data warehouses
Data lakes
Operational databasesIndustry leader 4 years in a row
#1 TPC-H performance
T-SQL query over any data
70 percent faster than Aurora
More global reach than any other
No Limits and 99.9 percent SLA
Easiest lift and shift
with no code changes
The Microsoft offering
SQL Server
Hybrid
Azure Data Services
Security and performanceFlexibility of choiceReason over any data, anywhere
SocialLOB Graph IoTImageCRM
4. © Microsoft Corporation
Azure Data
Factory
Azure Import/Export
service
Azure SDKAzure CLI
Cognitive ServicesBot service
Azure Search Azure Data Catalog
Azure ExpressRoute Azure network
security groups
Azure Functions Visual StudioOperations
Management Suite
Azure Active Directory Azure key
management service
Azure Blob Storage Azure Data Lake
Store
Azure IoT Hub Azure Event
Hubs
Kafka on Azure HDInsight
Azure SQL Data WarehouseAzure SQL DB Azure Cosmos DB Azure Analysis Services Power BI
Azure Data
Lake Analytics
Azure
HDInsight
Azure
Databricks
Azure
HDInsight
Azure
Databricks
Azure Stream
Analytics
Azure ML Azure
Databricks
ML Server
The Azure data landscape
5. © Microsoft Corporation
Azure Data
Factory
Azure Import/Export
service
Azure SDKAzure CLI
Cognitive ServicesBot service
Azure Search Azure Data Catalog
Azure ExpressRoute Azure network
security groups
Azure Functions Visual StudioOperations
Management Suite
Azure Active Directory Azure key
management service
Azure Blob Storage Azure Data Lake
Store
Azure IoT Hub Azure Event
Hubs
Kafka on Azure HDInsight
Azure SQL Data WarehouseAzure SQL DB Azure Cosmos DB Azure Analysis Services Power BI
Azure Data
Lake Analytics
Azure
HDInsight
Azure
Databricks
Azure
HDInsight
Azure
Databricks
Azure Stream
Analytics
Azure ML Azure
Databricks
ML Server
The Azure Big Data landscape
6. © Microsoft Corporation
Solution scenarios
Big Data and advanced analytics
SQL
Modern data warehousing
“We want to integrate all our
data—including Big Data—with
our data warehouse”
Advanced analytics
“We’re trying to predict when
our customers churn”
Real-time analytics
“We’re trying to get insights
from our devices in real-time”
7. © Microsoft Corporation
Real-time analytics
Real-time analytics—also called stream analytics—is the practice of processing data as soon as it’s
generated in order to enable very quick analysis and insight for timely action
SQL
Modern data warehousing
“We want to integrate all our
data—including Big Data—with
our data warehouse”
Advanced analytics
“We’re trying to predict when
our customers churn”
Real-time analytics
“We’re trying to get insights
from our devices in real-time”
10. © Microsoft Corporation
Big Data streaming pattern with Azure
Real-time applications
Real-time dashboards
Sensors and IoT
(unstructured)
Event hubs IoT hub Kafka on HDInsight Azure Stream
Analytics
Storm on
HDInsight
Azure Databricks
(Spark Streaming)
Azure ML
Studio
R Server Azure Databricks
(Spark ML)
Machine learning
Stream ingestion
Long-term storage
Stream analytics
Data Lake Store SQL DB Cosmos DB Azure Blob Storage
Business/custom apps
(structured)
Logs, files, and media
(unstructured)
Power BI
12. Azure is the only public cloud to offer Apache
Kafka as a managed service
Can be provisioned directly from the Azure Portal
Apache Kafka is one of the HDInsight cluster types
Clusters can be scaled within minutes
99.9 percent SLA
No additional charge for running Kafka clusters
Out-of-box management using Azure Monitor
Logs
Apache Kafka on HDInsight
A open-source, scalable, stream ingestion platform offered as a managed service on Azure HDInsight
13. © Microsoft Corporation
Provisioning Apache Kafka on HDInsight
A typical HDInsight Kafka cluster consists of:
Three or more worker nodes—at least three for data high availability
Two head nodes—for redundancy
Three zookeeper nodes
Kafka is I/O heavy, so Azure Managed Disks are used
for high throughput and more storage per node
Can deploy Apache Kafka on HDInsight clusters with
managed disks straight from Azure Portal
Disks or nodes can be configured during HDInsight
cluster creation—up to 16 TB per node
14. Kafka for Azure HDInsight
• Managed Kafka clusters with 99.9% service level
SLA
• Native integration with Azure Managed Disks.
Allows for exponentially lower costs, and higher
scale.
• Scalable On Demand clusters - Kafka clusters
with 16 TB/node and Zookeeper up and running
in 15 minutes
• Rack awareness for Kafka on the Azure cloud
• Alerting and predictive cluster maintenance
through Azure Monitor Logs
• Extensibility via one click deploy of leading ISVs
such as StreamSets
• Disaster recovery support via MirrorMaker
• Deploy End to End streaming pipelines with
Storm, Spark, Storage via automated ARM
templates in the same VNET.
15. Kafka is a distributed, horizontally-scalable, fault-tolerant pub-sub store
Broker 1
Producer 1
IoT Hub
Storm
Spark
Streaming
1
2
3
ZK 1 ZK 2 ZK 3
Broker 2
Broker 3
3
1
2
Topic 1
Topic 2 Topic 1
Topic 2
Topic 2
Topic 1
Data Ingestion using Kafka on HDInsight
16. 4 5
Setup the broker
configuration
Publish the
message
The consumer
reads the messages
Kafka: Producers and Consumers
17. © Microsoft Corporation
Choosing Apache Kafka on HDInsight
When you want… Description
A proven ingestion service
Apache Kafka is the de-facto leader in the Big Data stream ingestion space. It’s used by the who’s who
of modern internet companies. Powered by Apache Kafka lists companies using Apache Kafka.
A hybrid, multi-cloud solution with
choice of deployment models
You can run Apache Kafka in multiple ways: On-premises, as a managed service on Azure, as an IaaS
solution on Azure VMs, or even on other public clouds—including AWS and Google Cloud Service.
An open-source solution
Kafka is an open-sourced product licensed under Apache License 2.0. It’s implemented in Java and
Scala.
A highly reliable, fault-tolerant,
scalable service
Kafka is reported to scale to handle ingestion rates of 1.1 trillion messages a day at LinkedIn. Kafka is a
horizontally scalable service—you can scale Apache Kafka on HDInsight by dynamically adding more
nodes to the cluster.
Extensibility, with support for a
large number of data sources and
sinks
Kafka Connect is a tool for scaling and reliably streaming data between Apache Kafka and other
systems. It makes it simple to quickly define connectors that move large collections of data into and
out of Kafka. Pre-built connectors to a number of data sources are available. You can extend this list by
building custom connectors.
When Apache Kafka can be a good option
18. Azure
Gateway
Services
Open source Stream Processing on Azure HDInsight
Real-time applications
Long term storage
Real-time dashboards
IoT Hubs
Azure VNet Boundary
Connected Car Architecture Powered by HDInsight
19. Siphon on HDInsight Kafka 8 million
EVENTS PER SECOND PEAK INGRESS
800 TB (10 GB per Sec)
INGRESS PER DAY
1,800; 450
PRODUCTION KAFKA BROKERS; TOPICS
15 Sec
99th PERCENTILE LATENCY
KEY CUSTOMER
SCENARIOS
Ads Monetization (Fast BI)
O365 Customer Fabric NRT – Tenant & User insights
BingNRT Operational Intelligence
Presto (Fast SML) interactive analysis
Delve Analytics
0
5
10
15
20
25
30
35
40
45
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
Throughput(inGBps)
Siphon Data Volume (Ingress and Egress)
Series1 Series2
0
5
10
15
20
25
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
1-00
Throughput(eventspersec)Millions
Siphon Events per second (Ingress and Egress)
Series1 Series2
20. © Microsoft Corporation
Apache Spark 2.4 and Apache Kafka 2.1 support on Azure
HDInsight
https://azure.microsoft.com/updates/apache-spark-2-4-and-apache-kafka-2-1-support-on-azure-hdinsight/
22. © Microsoft Corporation
Big Data streaming pattern with Azure
Real-time applications
Real-time dashboards
Sensors and IoT
(unstructured)
Event hubs IoT hub Kafka on HDInsight Azure Stream
Analytics
Storm on
HDInsight
Azure Databricks
(Spark Streaming)
Azure ML
Studio
R Server Azure Databricks
(Spark ML)
Machine learning
Stream ingestion
Long-term storage
Stream analytics
Data Lake Store SQL DB Cosmos DB Azure Blob Storage
Business/custom apps
(structured)
Logs, files, and media
(unstructured)
Power BI
23. © Microsoft Corporation
✓ Input Capacity: 1 MB/s per TU*
✓ Output Capacity: 2 MB/s per TU*
✓ Latency: 50 ms avg, 99% < 100ms
✓ Events/second: 1,000
✓ Max message size: 256 KB
*In Azure Event Hubs, capacity is purchased in throughput units (TU). Add TUs to increase capacity.
Event
publisher
Partition
Partition
Partition
Reader
Reader
Reader
Event
Consumer
Event hubs
Azure Event Hubs:
Scale and performance
Azure Event Hubs
A highly scalable, fully-managed telemetry ingestion service
24. © Microsoft Corporation
Based on the concept of event producers and
consumers
Producers send data to an event hub via AMQP 1.0 or HTTPS
Consumers read event data from an event hub via AMQP 1.0
SAS tokens identifies and authenticates the event
publisher
Data can be captured automatically in either Azure
Blob Storage or Azure Data Lake Store (in AVRO
format)
Data is stored for 24 hours by default
84 GB storage included per throughput unit
Azure Event Hubs capabilities overview
27. © Microsoft Corporation
When you want… Description
To automatically scale capacity
Auto-inflate enables you to start small with the minimum required
throughput units. It then scales automatically to the maximum limit of
throughput units, depending on the increase in traffic
A serverless solution
Azure Event Hubs is a serverless service. Your ability to fine tune the
performance is limited
To integrate easily with Azure Stream Analytics
You can configure Azure Events Hubs as a streaming data input to Azure
Stream Analytics via the Azure Portal without any coding
A low-latency ingestion service
Azure Event Hubs latency can be less than 50 ms on average, with latency
under 100 ms 99 percent of the time*
To store ingested data in Azure Blob Storage
or Azure Data Lake Store
Azure Events Hubs has built in integration with these two Azure storage
services
* Note that other services might have a similar latency, but there are no publicly available numbers.
Choosing Event Hubs
When Azure Event Hubs can be a good option
28. Event Hubs in the real world:
Halo 5
80 million requests per minute
within 24 hours of release
All game telemetry and statistics
run through Azure Event Hubs,
processed, and sent back to console
1 Dedicated Capacity cluster (3 CUs)
Zero administration by Halo team
30. © Microsoft Corporation
• Azure Free Account: https://azure.microsoft.com/free/
• Azure Marketplace (VM Images, VM Cluster Templates, Container Images, Helm Chart):
https://azuremarketplace.microsoft.com/en-us/marketplace/apps?search=Kafka
• Third Party Managed Kafka Clusters
• Confluent Cloud: https://confluent.jp/confluent-cloud/
• Instaclustr: https://www.instaclustr.com/solutions/microsoft-azure/
• Azure HDInsight: https://docs.microsoft.com/azure/hdinsight/kafka/apache-kafka-introduction
• Azure Event Hubs: https://docs.microsoft.com/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview
• Kafka Connect
• Azure Blob Storage: https://docs.confluent.io/current/connect/kafka-connect-azure-blob-storage/
• Azure SQL Database (SQL Server): https://docs.confluent.io/current/connect/kafka-connect-cdc-mssql/
• Azure IoT Hub: https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apache-kafka-connector-iot-hub
Additional Information