Speaker: Robin Moffatt, Developer Advocate, Confluent
In this talk, we'll build a streaming data pipeline using nothing but our bare hands, the Kafka Connect API and KSQL. We'll stream data in from MySQL, transform it with KSQL and stream it out to Elasticsearch. Options for integrating databases with Kafka using CDC and Kafka Connect will be covered as well.
This is part 2 of 3 in Streaming ETL - The New Data Integration series.
Watch the recording: https://videos.confluent.io/watch/4cVXUQ2jCLgJNmg4kjCRqo?.
2. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 2
$ whoami
• Developer Advocate @ Confluent
• Working in data & analytics since 2001
• Oracle ACE Director & Dev Champion
• Blogging : https://rmoff.net & http://cnfl.io/rmoff
• Twitter: @rmoff
3. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 3
Housekeeping Items
● This session will last about an hour.
● This session will be recorded.
● You can submit your questions by entering them into the
GoToWebinar panel.
● The last 10-15 minutes will consist of Q&A.
● The slides and recording will be available after the talk.
4. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Streaming ETL
with Apache Kafka
and KSQL
5. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 5
Database offload Hadoop/Object Storage/Cloud DW for Analytics
HDFS / S3 /
BigQuery etc
RDBMS
6. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 6
Streaming ETL with Apache Kafka and KSQL
order items
customer
customer orders
Stream
Processing
RDBMS
7. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 7
Real-time Event Stream Enrichment with Apache Kafka and KSQL
order events
customer
Stream
Processing
customer orders
RDBMS
<y>
8. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 8
Transform Once, Use Many
order events
customer
Stream
Processing
customer orders
RDBMS
<y>
New App
<x>
9. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 9
Transform Once, Use Many
order events
customer
Stream
Processing
customer orders
RDBMS
<y>
HDFS / S3 / etc
New App
<x>
10. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
11. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
12. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 12
The Connect API of Apache Kafka®
✓ Fault tolerant and automatically load balanced
✓ Extensible API
✓ Single Message Transforms
✓ Part of Apache Kafka, included in
Confluent Open Source
Reliable and scalable integration of Kafka
with other systems – no coding required.
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo",
"table.whitelist": "sales,orders,customers"
}
https://docs.confluent.io/current/connect/
✓ Centralized management and configuration
✓ Support for hundreds of technologies including
RDBMS, Elasticsearch, HDFS, S3, syslog
✓ Supports CDC ingest of events from RDBMS
✓ Preserves data schema
13. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 13
Kafka Connect
Kafka Brokers
Kafka Connect
Tasks Workers
Sources Sinks
Amazon S3
syslog
flat file
CSV
JSON
MQTT
MQTT
14. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 14
Considerations for Integration into Apache Kafka
Photo by Matthew Smith on Unsplash
• Chucking data over the fence into a Kafka topic is
not enough
• We need standard ways of building data pipelines
in Kafka
• Schema handling
• Serialisation formats
15. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 15
Considerations for Integration into Apache Kafka
Photo by Matthew Smith on Unsplash
• Confluent Schema Registry & Avro is a great way to
do this
• Downstream users of the data can then easily use
the data
• KSQL
• Kafka Connect
• Kafka Streams
• Custom apps
16. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 16
The Confluent Schema Registry
MySQL
Avro
Message
Elasticsearch
Schema
RegistryAvro
Schema
Kafka
Connect
Kafka
ConnectAvro
Message
17. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 17
The Confluent Schema Registry
Source (MySQL) schema
is preserved
Target (Elasticsearch) schema
mapping is automagically built
18. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 18
Integrating Databases with Kafka
• CDC is a generic term referring to
capturing changing data typically
from a RDBMS.
• Two general approaches:
• Query-based CDC
• Log-based CDC
There are other options including hacks with
Triggers, Flashback etc but these are system and/or
technology-specific.
Read more: http://cnfl.io/kafka-cdc
19. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
• Use a database query to try and identify new & changed rows
• Implemented with the open source Kafka Connect JDBC connector
• Can import based on table names, schema, or bespoke SQL query
•Incremental ingest driven through incrementing ID column and/or
timestamp column
19
Query-based CDC
SELECT * FROM my_table
WHERE col > <value of col last time we polled>
Read more: http://cnfl.io/kafka-cdc
20. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 20
Log-based CDC
• Use the database's
transaction log to identify
every single change event
• Various CDC tools available
that integrate with Apache
Kafka (more of this later…)
Read more: http://cnfl.io/kafka-cdc
21. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 21
Query-based vs Log-based CDC
Photo by Matese Fields on Unsplash
• Query-based
+Usually easier to setup, and
requires fewer permissions
- Needs specific columns in
source schema
- Impact of polling the DB (or
higher latencies tradeoff)
- Can't track deletes
Read more: http://cnfl.io/kafka-cdc
22. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 22
Query-based vs Log-based CDC
Photo by Sebastian Pociecha on Unsplash
• Log-based
+Greater data fidelity
+Lower latency
+Lower impact on source
- More setup steps
- Higher system privileges required
- For propriatory databases, usually $$$
Read more: http://cnfl.io/kafka-cdc
23. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL 23
Which Log-Based CDC Tool?
For query-based CDC, use the Confluent Kafka Connect JDBC connector
• Open Source RDBMS,
e.g. MySQL, PostgreSQL
• Debezium
• (+ paid options)
• Mainframe
e.g. VSAM, IMS
• Attunity
• SQData
• Proprietory RDBMS,
e.g. Oracle, MS SQL
• Attunity
• IBM InfoSphere Data Replication
• Oracle GoldenGate
• SQData
• HVR
Read more: http://cnfl.io/kafka-cdc
All these options integrate with Apache Kafka and Confluent
Platform, including support for the Schema Registry
24. “
@rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
But I need to
join…aggregate…filter…
25. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Declarative
Stream
Language
Processing
KSQLis a
26. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
KSQLis the
Streaming
SQL Engine
for
Apache Kafka
27. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
KSQL in Development and Production
Interactive KSQL
for development and testing
Headless KSQL
for Production
Desired KSQL queries
have been identified
REST
“Hmm, let me try
out this idea...”
28. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
KSQL for Streaming ETL
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u
ON c.userid = u.user_id
WHERE u.level = 'Platinum';
Joining, filtering, and aggregating streams of event data
29. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
KSQL for Anomaly Detection
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
Identifying patterns or anomalies in real-time data,
surfaced in milliseconds
30. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
KSQL for Real-Time Monitoring
• Log data monitoring, tracking and alerting
• syslog data
• Sensor / IoT data
CREATE STREAM SYSLOG_INVALID_USERS AS
SELECT HOST, MESSAGE
FROM SYSLOG
WHERE MESSAGE LIKE '%Invalid user%';
http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting
31. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
CREATE STREAM views_by_userid
WITH (PARTITIONS=6, REPLICAS=5,
VALUE_FORMAT='AVRO',
TIMESTAMP='view_time') AS
SELECT * FROM clickstream
PARTITION BY user_id;
KSQL for Data Transformation
Make simple derivations of existing topics from the command line
32. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
DEMO!
33. @rmoff / Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
MySQL DebeziumKafka Connect
Producer API
Elasticsearch
Kafka Connect