O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

KSQL: Streaming SQL for Kafka

5.426 visualizações

Publicada em

This is an introduction to KSQL. KSQL is an open source, Apache 2.0 licensed streaming SQL engine that enables stream processing against Apache Kafka.

Publicada em: Software

KSQL: Streaming SQL for Kafka

  1. 1. 1Confidential KSQL: Streaming SQL for Kafka An Introduction Neil Avery, @avery_neil, September 2017
  2. 2. 2Confidential August 2018, Kafka Summit SF Announcement A Developer Preview of KSQL A Streaming SQL Engine for Apache KafkaTM from Confluent
  3. 3. 3Confidential Agenda ● What is KSQL for? ● Why KSQL? ● KSQL concepts ● Demo: Working with KSQL to process and visualize data ● Core concepts: Stream and Table ● Understand the KSQL ecosystem ● Roadmap
  4. 4. 4Confidential What is it for ? ● Streaming ETL ○ Kafka is popular for data pipelines. ○ KSQL enables easy transformations of data within the pipe ● Anomaly Detection ○ Identifying patterns or anomalies in real-time data, surfaced in milliseconds ● Monitoring ○ Log data monitoring, tracking and alerting ○ Sensor / IoT data CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum'; CREATE TABLE possible_fraud AS SELECT card_number, count(*) FROM auth_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3; CREATE TABLE error_counts AS SELECT error_code, count(*) FROM monitoring_strm WINDOW TUMBLING (SIZE 1 MINUTE) WHERE type = 'ERROR' GROUP BY error_code;
  5. 5. 5Confidential Why KSQL? Stream processing development is hard - you need developer skills; not good if you are a data-scientist, analyst, or non-developer. ● SQL based; simple & intuitive ● SQL simplifies deployment - no jars, no artifacts or binaries; just run SQL ● Interact and access your data via the CLI: SELECT * from XXX where A,B,C ● Easily get data-in and out of Kafka (and process it) ● Use SQL to process your data by leveraging Kafka Streams ● Built on Kafka and its Streams API: distributed, scalable, reliable, and real-time.
  6. 6. 6Confidential KSQL Concepts ● STREAM and TABLE as first-class citizens ● Interpretations of Topic content ● STREAM - data in motion ● TABLE - collected state of a stream (aggregations) ○ One record per key (per window) ○ Current values (compacted topic) ← Not yet in KSQL ○ Changelog
  7. 7. 7Confidential Let’s try it out... > KSQL
  8. 8. 8Confidential We can build this…
  9. 9. 9Confidential Start our Docker environment and generate Data Launch the clickstream docker image & run kafka. export KAFKA_HEAP_OPTS="-Xmx256M -Xms256M" $ docker run -p 33000:3000 -it confluentinc/ksql-clickstream-demo bash root@bf73923012ab:/# confluent start Starting zookeeper zookeeper is [UP] Starting kafka <<snip>> Run the Data generator to simulate web-traffic: root@bf73923012ab:/# ksql-datagen -daemon quickstart=clickstream format=json topic=clickstream maxInterval=100 iterations=500000 Writing console output to /tmp/ksql-logs/ksql.out root@bf73923012ab:/# tail -f /tmp/ksql-logs/ksql.out 111.203.236.146 --> ([ '111.203.236.146' | 36 | '-' | '07/Sep/2017:11:07:20 +0000' | 1504782440181 | 'GET /site/login.html HTTP/1.1' | '407' | '4196' | '-' | 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' ])
  10. 10. 10Confidential Start the KSQL CLI in client-server mode Start the KSQL server. $ ksql-server-start /etc/ksql/ksqlserver.properties > /tmp/ksql-logs/ksql-server.log 2>&1 & Start the CLI on port 8080. $ ksql-cli remote http://localhost:8080 ====================================== = _ __ _____ ____ _ = = | |/ // ____|/ __ | | = = | ' /| (___ | | | | | = = | < ___ | | | | | = = | . ____) | |__| | |____ = = |_|______/ __________| = = = = Streaming SQL Engine for Kafka = <<snip>>
  11. 11. 11Confidential Building a Stream STREAM: A stream is an unbounded sequence of structured data (“facts”). For example, a stream of financial transactions such as “Alice sent $100 to Bob, then Charlie sent $50 to Bob”. Facts in a stream are immutable, new facts can be inserted to a stream, existing facts can never be updated or deleted. CREATE STREAM clickstream (_time bigint,time varchar, ip varchar, request varchar, status int, userid varchar, bytes bigint, agent varchar) WITH (kafka_topic = 'clickstream', value_format = 'json');
  12. 12. 12Confidential KSQL> Working with Streams 1. ksql> list TOPICS; 2. ksql> CREATE STREAM clickstream (_time bigint,time varchar, ip varchar, request varchar, status int, userid varchar, bytes bigint, agent varchar) with (kafka_topic = 'clickstream', value_format = 'json'); 3. ksql> list STREAMS; 4. ksql> DESCRIBE CLICKSTREAM; 5. ksql> SELECT * from CLICKSTREAM limit 10; 6. ksql> SELECT * from CLICKSTREAM WHERE request like ‘%html%’;
  13. 13. 13Confidential Create and Interact with a Table TABLE: A table is a view of a STREAM and represents a collection of evolving facts. We could have a table that contains the latest financial information such as: “Bob’s current account balance is $150”. Similar to a traditional database table but enriched by streaming semantics such as windowing. ● Facts in a table are mutable, new facts can be inserted to the table, and existing facts can be updated or deleted. ● Tables can be created from a Kafka topic or derived from streams and tables. CREATE TABLE IP_SUM as SELECT ip, sum(bytes)/1024 as kbytes FROM CLICKSTREAM WINDOW SESSION (300 second) GROUP BY ip;
  14. 14. 14Confidential KSQL CLI> Build a TABLE using SELECT kql> SELECT ip, sum(bytes)/1024 as kbytes FROM CLICKSTREAM WINDOW SESSION (300 second) GROUP BY ip; 111.145.8.144 | 4 222.245.174.248 | 5 233.90.225.227 | 39 <<snip>> ksql> CREATE TABLE IP_SUM as SELECT ip, sum(bytes)/1024 as kbytes FROM CLICKSTREAM window SESSION (300 second) GROUP BY ip; ksql> SELECT * from IP_SUM limit 10; 1504788602258 | 233.173.215.103 : Window{start=1504788556778 end=-} | 233.173.215.103 | 374 <<snip>>
  15. 15. 15Confidential KSQL CLI> Build a TABLE using SELECT kql> LIST TABLES; Table Name | Kafka Topic | Format | Windowed ---------------------------------------------- IP_SUM | IP_SUM | JSON | true ksql> DESCRIBE IP_SUM; Field | Type --------------------------- ROWTIME | BIGINT ROWKEY | VARCHAR(STRING) IP | VARCHAR(STRING) KBYTES | BIGINT ksql> SELECT * from IP_SUM where IP like ‘%33%’ limit 10; 1505314606146 | 233.203.236.146 : Window{start=1505314602405 end=-} | 233.203.236.146 | 4
  16. 16. 16Confidential Visualize the Table in Grafana 1. Build a timestamped TABLE from a table. We need timestamped data for ES ksql> CREATE TABLE IP_SUM_TS as SELECT rowTime as event_ts, * FROM IP_SUM; 2. Start Elasticsearch $ /etc/init.d/elasticsearch start [....] Starting Elasticsearch Server 3. Start Grafana $ /etc/init.d/grafana-server start 4. Connect the Stream IP_SUM_TS to Elastic and add the datasource to Grafana # cd /usr/share/doc/ksql-clickstream-demo/ # ./ksql-connect-es-grafana.sh ip_sum_ts
  17. 17. 17Confidential Viewing the data in Grafana
  18. 18. 18Confidential Running the Clickstream demo From: https://github.com/confluentinc/ksql/tree/0.1.x/ksql-clickstream-demo 1. # ksql-datagen quickstart=clickstream_users format=json topic=clickstream_users maxInterval=10 iterations=50 2. # ksql-datagen quickstart=clickstream_codes format=json topic=clickstream_codes maxInterval=20 iterations=100 3. ksql> run script '/usr/share/doc/ksql-clickstream-demo/clickstream-schema.sql'; 4. # cd /usr/share/doc/ksql-clickstream-demo # ./ksql-tables-to-grafana.sh Loading Clickstream-Demo TABLES to Confluent-Connect => Elastic => Grafana datasource Logging to: /tmp/ksql-connect.log Charting CLICK_USER_SESSIONS_TS <<snip>> 5. # ./clickstream-analysis-dashboard.sh
  19. 19. 19Confidential View the dashboard
  20. 20. 20Confidential Do you think that’s a table you are querying?
  21. 21. 21Confidential The Stream-Table duality
  22. 22. 22Confidential Recap: Stream-Table duality ● STREAM and TABLE as first-class citizens ● Interpretations of topic content ● STREAM - data in motion ● TABLE - collected state of a stream (aggregations) ○ One record per key (per window) ○ Current values (compacted topic) ← Not yet in KSQL ○ Changelog
  23. 23. 23Confidential Window Aggregations Three types supported (same as KStreams): ● TUMBLING: Fixed-size, non-overlapping, gap-less windows ○ SELECT ip, count(*) AS hits FROM clickstream WINDOW TUMBLING (size 1 minute) GROUP BY ip; ● HOPPING: Fixed-size, overlapping windows ○ SELECT ip, SUM(bytes) AS bytes_per_ip_and_bucket FROM clickstream WINDOW HOPPING ( size 20 second, advance by 5 second) GROUP BY ip; ● SESSION: Dynamically-sized, non-overlapping, data-driven window ○ SELECT ip, SUM(bytes) AS bytes_per_ip FROM clickstream WINDOW SESSION (20 second) GROUP BY ip; More: http://docs.confluent.io/current/streams/developer-guide.html#windowing
  24. 24. 24Confidential Resources and Admin ● LIST TOPICS; ● LIST STREAMS; ● LIST TABLES; ● SHOW PROPERTIES; ● LIST QUERIES; ● If you need to stop one: ○ TERMINATE <query-id>;
  25. 25. 25Confidential Functions ● Scalar functions: ○ CONCAT, IFNULL, LCASE, LEN, SUBSTRING,TRIM, UCASE ○ ABS, CEIL, FLOOR, RANDOM, ROUND ○ StringToTimestamp, TimestampToString ○ EXTRACTJSONFIELD ○ CAST ● Aggregate functions: ○ SUM, COUNT, MIN, MAX
  26. 26. 26Confidential Developing in KSQL ● Interactive development using the CLI ● Capture SQL commands in a stream-application.sql ● Automate setup into your CI ● <more tools coming> set 'commit.interval.ms'='2000'; set 'cache.max.bytes.buffering'='10000000'; set 'auto.offset.reset'='earliest'; DROP STREAM clickstream; CREATE STREAM clickstream (_time bigint,time varchar, ip varchar, request varchar, status int, userid int, bytes bigint, agent varchar) with (kafka_topic = 'clickstream', value_format = 'json'); DROP TABLE events_per_min; create table events_per_min as select userid, count(*) as events from clickstream window TUMBLING (size 10 second) group by userid; -- VIEW - Enrich with rowTime DROP TABLE events_per_min_ts; CREATE TABLE events_per_min_ts as select rowTime as event_ts, * from events_per_min;
  27. 27. 27Confidential The KSQL Architecture & Ecosystem Stream [p1, p2, p3] Table [p1, p2, p3] insert() select() insert() insert() select() select() Topic [p1, p2, p3]
  28. 28. 28Confidential Mode #1 Stand-alone aka ‘local mode’ ● Starts a CLI, an Engine, and a REST server all in the same JVM ● Ideal for laptop development etc. ○ Use with default settings: > bin/ksql-cli local ○ Or with customized settings: > bin/ksql-cli local --properties-file foo/bar/ksql.properties ● Careful with service and command topic naming! (more on this in a moment...)
  29. 29. 29Confidential Mode #2 Client-Server ● Start any number of Server nodes ○ > bin/ksql-server-start ○ > bin/ksql-server-start --properties-file foo.properties ● Start any number of CLIs, specifying a server address as ‘remote’ endpoint ○ > bin/ksql-cli remote http://server:8090 ● All Engines share the work ○ Instances of the same KStreams Apps ○ Scale up/down without restarting
  30. 30. 30Confidential KSQL Session Variables ● Just as in MySQL, ORACLE etc. there are settings to control how your CLI behaves ● Defaults can be set in the ksql.properties file ● To see a list of currently set or default variable values: ○ ksql> show properties; ● Useful examples: ○ num.stream.threads=4 ○ commit.interval.ms=1000 ○ cache.max.bytes.buffering=2000000 ● TIP! - Your new best friend for testing or building a demo is: ○ ksql> set ‘auto.offset.reset’ = ‘earliest’;
  31. 31. 31Confidential Roadmap, 2018 ● GA of current feature set. Improved quality, stability, and operations ● Complete our view of what a SQL streaming platform should provide for Streams and Tables ● Additional aggregate functions. We will continue to expand the set of analytics functions ● Testing tools. Many data-platforms suffer from an inherent inability to test. With KSQL testing capability is a primary focus and we will provide frameworks to support continuous integration and unit test [subject to change]
  32. 32. Kafka Summit is coming to London! April 23-24, 2018 Subscribe for updates on CFP, sponsorships and more at www.kafka-summit.org
  33. 33. 33Confidential Thank you Neil Avery, neil@confluent.io @avery_neil

×