4. 4
• Q&A
• 궁금한 점이 있으시다면 Q&A를 통해 질문 보내주시기 바랍니다. 발
표 이후 연사가 직접 답변 전달할 예정입니다.
• 온라인 설문조사
• 금일 워크샵에 대한 소중한 의견 보내주시기 바랍니다. 향후 알찬 내
용을 준비하는데 참고하겠습니다.
• 설문조사 참여 링크는 (1) Zoom 채팅창 통해 확인, (2) 행사 종료 이
후 웹 브라우저 통해 자동 참여
워크샵 안내사항
7. NoSQL DBs Big Data Analytics
?
App App
DWH
Transactional
Databases
Analytics
Databases
Data Flow
DB DB
App App
MOM MOM
ETL
ETL
EAI / ESB
App App
8. :
스트리밍 플랫폼은 조직
의 모든 사람과 시스템에
게 데이터에 대한 단일
정보 소스(single source
of truth)를 제공한다.
NoSQL DBs Big Data Analytics
App App
DWH
Transactional
Databases
Analytics
Databases
Data Flow
DB DB
App App
App App
Streaming Platform
26. 26
ksqlDB , , Push Pull
DB
APP
APP
DB
PULL
PUSH
CONNECTORS
STREAM
PROCESSING
STATE STORES
ksqlDB
1 2
APP
27. Serve lookups against
materialized views
Create
materialized views
Perform continuous
transformations
Capture data
CREATE STREAM purchases AS
SELECT viewtime, userid,pageid, TIMESTAMPTOSTRING(viewtime, 'yyyy-MM-dd')
FROM pageviews;
CREATE TABLE orders_by_country AS
SELECT country, COUNT(*) AS order_count, SUM(order_total) AS order_total
FROM purchases
WINDOW TUMBLING (SIZE 5 MINUTES)
LEFT JOIN user_profiles ON purchases.customer_id = user_profiles.customer_id
GROUP BY country
EMIT CHANGES;
SELECT * FROM orders_by_country WHERE country='usa';
CREATE SOURCE CONNECTOR jdbcConnector WITH (
‘connector.class’ = '...JdbcSourceConnector',
‘connection.url’ = '...',
…);
Connector
Stream
Table
Query
SQL
28. Filter messages to a separate topic in real-time
28
Partition 0
Partition 1
Partition 2
Topic: Blue and Red Widgets
Partition 0
Partition 1
Partition 2
Topic: Blue Widgets Only
STREAM
PROCESSING
Filters
29. 29
Filters CREATE STREAM high_readings AS
SELECT sensor,
reading,
FROM readings
WHERE reading > 41
EMIT CHANGES;
30. Easily merge and join topics to one another
30
Partition 0
Partition 1
Partition 2
Topic: Blue and Red Widgets
Partition 0
Partition 1
Partition 2
Topic: Green and Yellow Widgets
Partition 0
Partition 1
Partition 2
Topic: Blue and Yellow Widgets
STREAM
PROCESSING
Joins
32. Aggregate streams into tables and capture
summary statistics
32
Partition 0
Partition 1
Partition 2
Topic: Blue and Red Widgets Table: Widget Count
STREAM
PROCESSING
Widget Color Count
Blue 15
Red 9
Aggregate
33. 33
Aggregate CREATE TABLE avg_readings AS
SELECT sensor,
AVG(reading) AS location
FROM readings
GROUP BY sensor
EMIT CHANGES;
35. 35
• Zoom과 브라우저(Instructions, ksqlDB console 및 Confluent
Control Center)로 작업하게 됩니다.
• 질문이 있는 경우 Zoom chat 기능을 통해 게시할 수 있습니다.
• 막히더라도 걱정하지 마세요 - Zoom에서 "Raise hand" 버튼을
사용하면 Confluent 엔지니어가 도와드릴 것입니다.
• 그냥 앞질러서 복사하여 붙여넣기 하는 것을 피하십시오 - 대부분의
사람들은 실제로 콘솔에 코드를 입력할 때 더 잘 배웁니다. 그리고
실수로부터 배울 수 있습니다.
•
교육 진행하는 방법
38. Use Case -
38
• . / ,
.
• 9/12/19 12:55:05 GMT, 5313, {
"rating_id": 5313,
"user_id": 3,
"stars": 1,
"route_id": 6975,
"rating_time": 1519304105213,
"channel": "web",
"message": "why is it so difficult to keep the bathrooms clean?"
}
39. Use Case - Approach 1
39
리뷰를 데이터 웨어하우스로 이동시킵니다.
매월 말에 검토를 처리한 다음, 상당한 수의 의견이 접수된 해
당 부서에 전달합니다.
이 접근 방식은 이미 발생했었던 일을 알려줍니다.
40. Use Case - Approach 2
40
실시간으로 리뷰를 처리하고 공항 관리팀에 대시보드를
제공합니다.
이 대시보드는 주제별로 리뷰를 정렬하여 청결과 관련된
문제를 신속하게 표시할 수 있습니다.
이 접근 방식은 지금 무슨 일이 일어나고 있는지 알려줍
니다.
41. Use Case - Approach 3
41
실시간으로 리뷰를 처리합니다.
최근 10 동안의 화장실 청결과 관련된 3 나쁜 리뷰
에 대한 알림을 설정합니다.
자동으로 청소 직원을 호출하여 문제를 처리합니다.
이 접근 방식은 무슨 일이 일어나고 있는지에 따라 무언
가를 수행합니다.
52. The key to mutability is … the event.key!
52
Stream Table
Has unique key constraint? No Yes
First event with key ‘alice’ arrives INSERT INSERT
Another event with key ‘alice’ arrives INSERT UPDATE
Event with key ‘alice’ and value == null arrives INSERT DELETE
Event with key == null arrives INSERT <ignored>
RDBMS analogy: A Stream is ~ a Table that has no unique key and is append-only.
56. KSQL for Data Exploration
An easy way to inspect your data in Kafka
SHOW TOPICS;
SELECT page, user_id, status, bytes
FROM clickstream
WHERE user_agent LIKE 'Mozilla/5.0%';
PRINT 'my-topic' FROM BEGINNING;
56
57. KSQL for Data Transformation
Quickly make derivations of existing data in Kafka
CREATE STREAM clicks_by_user_id
WITH (PARTITIONS=6,
TIMESTAMP='view_time’
VALUE_FORMAT='JSON') AS
SELECT * FROM clickstream
PARTITION BY user_id;
Change number of partitions
1
Convert data to JSON
2
Repartition the data
3
57
60. KSQL for Real-Time, Streaming ETL
Filter, cleanse, process data while it is in motion
CREATE STREAM clicks_from_vip_users AS
SELECT user_id, u.country, page, action
FROM clickstream c
LEFT JOIN users u ON c.user_id = u.user_id
WHERE u.level ='Platinum'; Pick only VIP users
1
60
61. CDC — only after state
61
JSON 데이터는 Debezium CDC를 통해
MySQL에서 가져오는 정보를 보여줍니다.
여기서 "BEFORE" 데이터가 없음을 알 수 있습
니다(null임).
이것은 레코드가 업데이트 없이 방금 생성되었음
을 의미합니다. 새 고객이 처음 추가된 경우를 예
로 들 수 있습니다.
62. CDC — before and after
62
이제 고객 레코드에 대한 업데이트가 있었기 때
문에 일부 "BEFORE" 데이터가 있습니다.
63. KSQL for Anomaly Detection
Aggregate data to identify patterns and anomalies in real-time
CREATE TABLE possible_fraud AS
SELECT card_number, COUNT(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 30 SECONDS)
GROUP BY card_number
HAVING COUNT(*) > 3;
Aggregate data
1
… per 30-sec windows
2
63
64. KSQL for Real-Time Monitoring
Derive insights from events (IoT, sensors, etc.) and turn them into actions
CREATE TABLE failing_vehicles AS
SELECT vehicle, COUNT(*)
FROM vehicle_monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE event_type = 'ERROR’
GROUP BY vehicle
HAVING COUNT(*) >= 5; Now we know to alert, and whom
1
64
72. Storage Layer
(Brokers)
Processing Layer
(ksqlDB, KStreams,
etc.)
Partitions play a central role in Kafka
72
Topics are partitioned. Partitions enable scalability, elasticity, fault-tolerance.
stored in
replicated based on
ordered based on
partitions
Data is
joined based on
read from and written to
processed based on
73. Processing
Layer
(KSQL,
KStreams)
00100 11101 11000 00011 00100 00110
Topic
alice Paris bob Sydney alice Rome
Stream
plus schema (serdes)
alice 2
bob 1
Table
plus aggregation
Storage Layer
(Brokers)
Topics vs. Streams and Tables
73
74. Kafka Processing
Data is processed per-partition
...
...
...
...
P1
P2
P3
P4
Storage Processing
read via
network
Topic App Instance 1 Application
App Instance 2
‘payments’ with consumer group
‘my-app’
74
79. Windowing
79
“10 3 ”
Windowed Query ksqlDB 로직을 .
.
Tumbling Hopping Session
WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY key
WINDOW HOPPING (SIZE 5 MINUTE, ADVANCE BY 1 MINUTE)
GROUP BY key
WINDOW SESSION (60 SECONDS)
GROUP BY key
80. UDF and machine learning
80
“ ”
ksqlDB는 스트림 처리를 단순화하기 위해 여러 내장 함수들을 제공합니다. 예는 다음과 같습니다.:
• GEODISTANCE: 두 위도/경도 좌표 사이의 거리를 측정
• MASK: 문자열을 마스크하거나 난독화된 버전으로 변환
• JSON_ARRAY_CONTAINS: 배열에 검색 값이 포함되어 있는지 확인
사용자 정의 함수를 개발하여 ksqlDB에서 사용 가능한 기능을 확장합니다. 일반적인 사용 사례는 ksqlDB를
통해 기계 학습 알고리즘을 구현하여 이러한 모델이 실시간 데이터 변환에 기여할 수 있도록 하는 것입니다.
81. ksqlDB ?
81
Streaming ETL Anomaly detection
Real-time monitoring
and Analytics
Sensor data and IoT Customer 360-view
https://docs.ksqldb.io/en/latest/#what-can-i-do-with-ksqldb
82. Example: Streaming ETL pipeline
82
* Full example here
• Apache Kafka is a popular choice for powering data pipelines
• ksqlDB makes it simple to transform data within the pipeline,
preparing the messages for consumption by another system.
83. Example: Anomaly detection
83
• Identify patterns and spot anomalies in real-time data with
millisecond latency, enabling you to properly surface out-of-the-
ordinary events and to handle fraudulent activities separately.
* Full example here
91. Free eBooks
Kafka: The Definitive Guide
Neha Narkhede, Gwen Shapira, Todd
Palino
Making Sense of Stream Processing
Martin Kleppmann
I ❤ Logs
Jay Kreps
Designing Event-Driven Systems
Ben Stopford
http://cnfl.io/book-bundle
94. Max processing parallelism = #input partitions
...
...
...
...
P1
P2
P3
P4
Topic Application Instance 1
Application Instance 2
Application Instance 3
Application Instance 4
Application Instance 5 *** idle ***
Application Instance 6 *** idle ***
→ Need higher parallelism? Increase the original topic’s partition count.
→ Higher parallelism for just one use case? Derive a new topic from the
original with higher partition count. Lower its retention to save storage.
94
95. How to increase # of partitions when needed
CREATE STREAM products_repartitioned
WITH (PARTITIONS=30) AS
SELECT * FROM products
95
KSQL example: statement below creates a new stream with the desired number of partitions.
96. ‘Hot’ partitions is a problem, often caused by
Strategies to address hot partitions include
1a. Ingress: Find better partitioning function ƒ(event.key) for producers
1b. Storage: Re-partition data into new topic if you can’t change the original
2. Scale processing vertically, e.g. more powerful CPU instances
...
...
...
...
P1
P2
P3
P4
96
1. Events not evenly distributed across partitions
2. Events evenly distributed but certain events take longer to process
97. Joining Streams and Tables
Data must be ‘co-partitioned’
Table
Stream
Join Output
(Stream) 97
98. Joining Streams and Tables
Data must be ‘co-partitioned’
bob male
alice female
alex male
alice Paris
Table
P1
P2
P3
zoie female
andrew male
mina female
natalie female
blake male
alice Paris
Stream
P2
(alice, Paris) from
stream’s P2 has a
matching entry for
alice in the table’s P2.
female 98
99. Joining Streams and Tables
Data is looked up in same partition number
99
alice Paris alice male
alice female
alice Paris
Stream Table
P2 P1
P2
P3
Here, key ‘alice’ exists in
multiple partitions.
But entry in P2
(female) is used
because the stream-
side event is from
stream’s partition P2.
female
Scenario 2
100. Joining Streams and Tables
Data is looked up in same partition number
100
alice Paris alice male
alice Paris
Stream Table
P2 P1
P2
P3
Here, key ‘alice’ exists
only in the table’s P1 !=
P2.
null
no
match!
Scenario 3
101. Data co-partitioning requirements in detail
Further Reading on Joining Streams and Tables:
https://www.confluent.io/kafka-summit-sf18/zen-and-the-art-of-streaming-joins
https://docs.confluent.io/current/ksql/docs/developer-guide/partition-data.html
101
1. Same keying scheme for both input sides
2. Same number of partitions
3. Same partitioning function ƒ(event.key)
102. Why is that so?
Because of how input data is mapped to stream tasks
...
...
...
P1
P2
P3
storage
processing state
Stream Task 2
read via
network
Strea
m
Topic
...
...
...
P1
P2
P3
Table
Topic
from stream’s P2
from table’s P2
102
103. How to re-partition your data when needed
CREATE STREAM products_repartitioned
WITH (PARTITIONS=42) AS
SELECT * FROM products
PARTITION BY product_id;
103
KSQL example: statement below creates a new stream with changed number of partitions and a new field as
event.key (so that its data is now correctly co-partitioned for joining)