Mais conteúdo relacionado Semelhante a Streaming ETL - from RDBMS to Dashboard with KSQL (20) Streaming ETL - from RDBMS to Dashboard with KSQL2. Things I am good at
•Oracle (and relational) databases
•Performance
•High-Availability
•PL/SQL and ETL
•Replication
•Exadata
•Automation/DevOps
•Linux and Solaris
•VMs and solaris containers
© 2016 Pythian 11
3. Things I am getting good at
•Kafka and streaming
•Cloud and cloud native data processing
•Dataflow, bigquery
•Machine learning
•docker
© 2016 Pythian 12
4. Things I am not good at
And have limited interest in
•“real” programming
•Especially java
•GUIs
•Coming up with meaningful demos
© 2016 Pythian 13
5. ABOUT PYTHIAN
Pythian’s 400+ IT professionals help
companies adopt
and manage disruptive technologies
to better compete
© 2016 Pythian 14
6. TECHNICAL EXPERTISE
© 2016 Pythian. Confidential 15
Infrastructure: Transforming and
managing the IT infrastructure
that supports the business
DevOps: Providing critical velocity
in software deployment by adopting
DevOps practices
Cloud: Using the disruptive
nature of cloud for accelerated,
cost-effective growth
Databases: Ensuring databases
are reliable, secure, available and
continuously optimized
Big Data: Harnessing the transformative
power of data on a massive scale
Advanced Analytics: Mining data for
insights & business transformation
using data science
7. © 2016 Pythian 16
assumptions
•You know more about kafka than me
•Today you do not want to hear much
about how great Oracle is
8. AGENDA
• Motivation / what are we going to build here?
• Getting rdbms data into kafka
• streaming ETL and KSQL
• Feeding kafka into grafana
• Demo time! (or Q&A)
© 2016 Pythian 17
12. The 3 Vs of Big Data
© 2016 Pythian 21
Volume
VarietyVelocity
13. StreamingRDBMS the “king of state”
© 2016 Pythian 22
• Takes transactions and stores
consistent state
• Tell you what *is* or *was*
• One central ”system of record”
• Sucks for large volumes of logs
• Great at updates, deletes and
rollbacks
• Every DB speaks SQL
• Stores and distributes events
• Tell you what *happened*
• Has a concept of order
• Connects many different systems
• Sucks at accounting and inventories
• Append-only
• Processing = programming*
14. • I have $42 in my bank account
• The address of user xx is yyy
• Inventory
• Invoice and order data
• Spatial objects (maps)
• A transferred $42 to B
• Address change
• Add or remove an item
• Clickstreams and logs
• IoT messages
• Location movements (GPS)
• Gaming actions
Event examplesState examples
© 2016 Pythian 23
15. Demo setup in mysql
© 2016 Pythian 24
mysql>select * from orders order by id desc limit 5;
+-------+---------+-------+---------+
| id | product | price | user_id |
+-------+---------+-------+---------+
| 10337 | wine | 10 | 3 |
| 10336 | olives | 1 | 14 |
| 10335 | olives | 3 | 7 |
| 10334 | olives | 8 | 32 |
| 10333 | salt | 3 | 27 |
+-------+---------+-------+---------+
5 rows in set (0.00 sec)
17. Kafka-connect-jdbc
• open source connector
• runs a query every n seconds
•Remembers offset
•Really only captures inserts
•Broken Data type mapping (oracle)
•Issues withTimezones (oracle)
© 2016 Pythian 26
19. Simple diary example
© 2016 Pythian 28
mysql>describe diary;
+-------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+----------------+
| id | smallint(6) | NO | PRI | NULL | auto_increment |
| event | varchar(42) | YES | | NULL | |
+-------+-------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)
mysql>select * from diary order by id desc limit 5;
+----+---------------------------------------------+
| id | event |
+----+---------------------------------------------+
| 18 | i hate the snow |
| 17 | still jealous i did not get to go to israel |
| 16 | i am jealous i did not get to go to Israel |
| 15 | i am jealous i did not get to go to india |
| 13 | i am very cold and alone |
+----+-----------------------------------_---------+
5 rows in set (0.00 sec)
20. Diary example
© 2016 Pythian 29
mysql>insert into diary (event) values ('I would love to meet the meetup
guys');
Query OK, 1 row affected (0.00 sec)
mysql>select * from diary order by id desc limit 2;
+----+--------------------------------------+
| id | event |
+----+--------------------------------------+
| 19 | I would love to meet the meetup guys |
| 18 | i hate the snow |
+----+--------------------------------------+
2 rows in set (0.00 sec)
21. Connect-jdbc-diary.properties
© 2016 Pythian 30
name=mysql-diary-source
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:mysql://localhost:3306/code_demo?user=lumpy&password=lumpy
table.whitelist=diary
mode=incrementing
incrementing.column.name=id
topic.prefix=mysql-
22. Still simple but not as easy: inventory
© 2016 Pythian 31
SQL>describe inventory;
Name Null? Type
----------------------------------------- -------- ------------------------
ID NOT NULL NUMBER(8)
NAME VARCHAR2(42)
COUNT NUMBER(8)
SQL>select * from inventory;
ID NAME COUNT
---------- ------------ ----------
1 nametag 1
4 friends 294
5 selfies 1005
23. Still simple but not as easy: inventory
© 2016 Pythian 32
SQL>update inventory set count=count+2 where name='friends';
1 row updated.
SQL>select * from inventory;
ID NAME COUNT
---------- ------------ ----------
1 nametag 1
4 friends 296
5 selfies 1005
24. How about one extra column to catch updates?
© 2016 Pythian 33
alter table inventory add (last_modified timestamp);
25. How about two extra columns to catch deletes?
© 2016 Pythian 34
alter table inventory add (valid_from timestamp,
valid_to timestamp);
26. © 2016 Pythian 35
Poor man’s CDC
• SELECT … VERSIONS BETWEEN …
• this adds pseudocolumns
• version_starttime in TS format
• version_operation
• the data is gathered from UNDO by default
• > 11.2.0.4 allow basic flashback data archives
without extra licenses
• specify retention period for as long as you want
Oracle flashback query
27. flashback query output
© 2016 Pythian 36
ID NAME COUNT O VERSIONS_STARTTIME
---- ------------ ------- - --------------------------------
4 friends 42 I 27-JUN-17 05.10.17.000000000 AM
3 shrimp 1 I 27-JUN-17 05.10.17.000000000 AM
6 mouse ears 2 D 27-JUN-17 03.51.50.000000000 PM
4 friends 42 U 27-JUN-17 05.10.41.000000000 AM
6 mouse ears 2 I 27-JUN-17 05.23.11.000000000 AM
5 selfies 1001 U 27-JUN-17 03.56.12.000000000 PM
5 selfies 1000 U 27-JUN-17 05.10.41.000000000 AM
4 friends 42 U 27-JUN-17 03.51.22.000000000 PM
4 friends 92 U 27-JUN-17 10.14.14.000000000 PM
4 friends 117 U 27-JUN-17 10.23.17.000000000 PM
4 friends 142 U 27-JUN-17 10.28.21.000000000 PM
5 selfies 1002 U 27-JUN-17 03.56.22.000000000 PM
select id, name, count, versions_operation, versions_starttime from
inventory versions between scn minvalue and maxvalue order by
versions_starttime;
28. © 2016 Pythian 37
•aka total recall
•background job mines UNDO
•saves data to special tables
•create flashback archive per table
•define retention
•extends flashback query
flashback data archives
29. flashback query config for connect-jdbc
© 2016 Pythian 38
connection.url=jdbc:oracle:thin:lumpy/lumpy@//localhost:1521/BRORCL
query=select id, name, count, versions_operation, versions_starttime
from inventory versions between scn minvalue and maxvalue
mode=timestamp+incrementing
timestamp.column.name=VERSIONS_STARTTIME
incrementing.column.name=ID
topic.prefix=connect-inventory
31. •DBs typically separate data (random and
async) from logs (sync and sequential)
•This increases performance and
recoverability
•Bonus: log of all changes
•Different names, same concept
•Oracle: redo and archivelogs
•Mysql: binlogs
•Postgres: Write-Ahead-Logs (WAL)
•SQL Server: transaction logs
Databases already have
”event” logs
© 2016 Pythian 40
34. Maxwell for mysql
•Reads binlogs directly
•Has it’s own json format (read: not kafka-connect)
•Open, easy, awesome
© 2016 Pythian 43
35. Maxwell setup
© 2016 Pythian 44
maxwell --user='maxwell' --password='maxwell’
--host='127.0.0.1' --producer=kafka
--kafka.bootstrap.servers=localhost:9092
--kafka_topic=maxwell_%{database}_%{table}
36. Maxwell output
© 2016 Pythian 45
{"database":"code",
"table":"orders",
"type":"insert",
"ts":1516802610,
"xid":42025,
"commit":true,
"data":{"id":12734,
"product":"salt",
"price":7,
"user_id":24
}
}
38. © 2016 Pythian 47
•Transform raw data from
transactional systems
•Store it again optimized for
analytics and reports
•Star-schema
•Aggregates and roll-ups
•Runs in batches, typically nighlty
ETL for traditional analytics
39. •In-memory
•Column stores
•Report in real-time
•Decision-support
•Machine learning and AI
•New data sources
•Clickstream
•IoT
•Big Data
© 2016 Pythian 48
Hot topics in analytics
40. KSQL and Event Stream Processing
•Kafka already has kafka streams for
processing
•But you need to actually write code
▪Same problem with Apache Spark and Dataflow
(Apache Beam) etc etc
•KSQL allows stream processing with the
language you probably already know
•Currently in ”developer-preview”
© 2016 Pythian 49
41. What’s the deal with streaming data processing?
© 2016 Pythian 50
bounded unbounded
Finite, complete,
consistent
Infinite, uncomplete, different
inconsistent sources
43. Creating stream from topic and transforms
© 2016 Pythian 52
create stream orders_raw (data map(varchar, varchar))
with (kafka_topic = 'maxwell_code_orders', value_format = 'JSON’);
ksql>describe orders_raw;
Field | Type
------------------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
DATA | MAP[VARCHAR(STRING),VARCHAR(STRING)]
------------------------------------------------
44. Creating stream from topic and transforms
© 2016 Pythian 53
ksql>select * from orders_raw limit 5;
1516805044165 | {"database":"code","table":"orders","pk.id":546} |
{product=wine, user_id=31, price=1, id=546}
1516805044304 | {"database":"code","table":"orders","pk.id":547} |
{product=salt, user_id=17, price=2, id=547}
1516805044423 | {"database":"code","table":"orders","pk.id":548} |
{product=salt, user_id=16, price=6, id=548}
1516805044550 | {"database":"code","table":"orders","pk.id":549} |
{product=olives, user_id=11, price=8, id=549}
1516805044683 | {"database":"code","table":"orders","pk.id":550} |
{product=salt, user_id=36, price=3, id=550}
LIMIT reached for the partition.
Query terminated
45. Creating stream from topic and transforms
© 2016 Pythian 54
create stream orders_flat as select data['id'] as id,
data['product'] as product,
data['price'] as price,
data['user_id'] as user_id
from orders_raw;
ksql>describe orders_flat;
Field | Type
-------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
ID | VARCHAR(STRING)
PRODUCT | VARCHAR(STRING)
PRICE | VARCHAR(STRING)
USER_ID | VARCHAR(STRING)
-------------------------------------
46. Creating stream from topic and transforms
© 2016 Pythian 55
create stream orders as select cast(id as integer) as id,
product,
cast(price as bigint) as price,
cast(user_id as integer) as user_id
from orders_flat;
ksql>describe orders;
Field | Type
-------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
ID | INTEGER
PRODUCT | VARCHAR(STRING)
PRICE | BIGINT
USER_ID | INTEGER
-------------------------------------
47. Creating stream from topic and transforms
© 2016 Pythian 56
ksql>select * from orders limit 5;
1516805228829 | {"database":"code","table":"orders","pk.id":2031} | 2031 |
olives | 1 | 21
1516805228964 | {"database":"code","table":"orders","pk.id":2032} | 2032 |
salt | 2 | 28
1516805229114 | {"database":"code","table":"orders","pk.id":2033} | 2033 |
wine | 1 | 26
1516805229254 | {"database":"code","table":"orders","pk.id":2034} | 2034 |
wine | 5 | 2
1516805229377 | {"database":"code","table":"orders","pk.id":2035} | 2035 |
salt | 5 | 1
LIMIT reached for the partition.
Query terminated
51. Slicing a stream into windows
© 2016 Pythian 60
08:00 08:05 08:10 08:15 08:20
52. Late arrivals make this more complicated…
© 2016 Pythian 61
08:00 08:05 08:10 08:15 08:20
event_ts=8:02
56. Create a windowed aggregate in ksql
© 2016 Pythian 65
create table orders_per_min as select product,
sum(price) amount
from orders
window hopping (size 60 seconds,
advance by 15 seconds)
group by product;
CREATE TABLE orders_per_min_ts as select rowTime as event_ts, *
from orders_per_min;
57. Create a windowed aggregate in ksql
© 2016 Pythian 66
ksql>select event_ts, product, amount from
orders_per_min_ts limit 20;
1516805280000 | olives | 444
1516805295000 | olives | 436
1516805310000 | olives | 307
1516805325000 | olives | 125
1516805280000 | salt | 921
1516805295000 | salt | 906
1516805310000 | salt | 528
1516805325000 | salt | 229
1516805280000 | wine | 470
1516805295000 | wine | 470
1516805310000 | wine | 305
1516805325000 | wine | 103
58. Aggregate functions
© 2016 Pythian 67
Function Example Description
COUNT COUNT(col1) Count the number of rows
MAX MAX(col1)
Return the maximum value for a
given column and window
MIN MIN(col1)
Return the minimum value for a
given column and window
SUM SUM(col1) Sums the column values
TOPK TOPK(col1, k)
Return the TopK values for the
given column and window
TOPKDISTINCT TOPKDISTINCT(col1, k)
Return the distinct TopK values
for the given column and window
59. Demo time!
© 2016 Pythian 68
mysql
maxwell
kafka
ksql
elastic
clickstream
Huge credit to github clickstream demo
61. •RDBMS also want to speak “stream”
•Stream processing is coming fast and
is here to stay
•KSQL is something to be excited
about
© 2016 Pythian 71
Summary
https://github.com/bjoernrost/mysql-ksql-etl-demo