KSQL is the streaming SQL engine for Apache Kafka. It provides an easy and completely interactive SQL interface for stream processing on Kafka. Users can express their processing logic in SQL like statements and KSQL will compile and execute them as Kafka Streams applications. Although KSQL provides a rich set of features and built in functions, many use cases require more domain specific processing logic that cannot be expressed in pure SQL. To enable users to use KSQL in such scenarios, KSQL provides a framework to define complex processing logic as User Defined Functions (UDFs) and User Defined Aggregate Functions (UDAFs). In this talk, we provide a deep dive into the UDF/UDAF framework in KSQL. We explain how users can define their custom UDFs/UDAFs and use them in their queries. We also describe how KSQL utilizes the provided UDFs/UDAFs under the hood to process streams and tables. This deep dive will include an insight into how UDFs process data and how UDAFs keep track of their state. Armed with such knowledge, KSQL users will be able to define and utilize complex data processing logic in their KSQL queries. They will also be able to diagnose and fix issues in defining and deploying their UDFs/UDAFs more efficiently.
6. 6
Goal
● If you haven’t used KSQL yet...
○ Try it today!
● Extend KSQL with your custom computation through
○ UDF
○ UDAF
7. 7
KSQL: the Streaming SQL Engine for Apache Kafka
®
from Confluent
● Enables stream processing with zero coding required
● The simplest way to process streams of data in real
time
● Powered by Kafka: scalable, distributed, battle-tested
● All you need is Kafka–no complex deployments of
bespoke systems for stream processing
8. 8
What is it for?
Streaming ETL
● Kafka is popular for data pipelines.
● KSQL enables easy transformations of data within the pipe
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
9. 9
What is it for?
Anomaly Detection
● Identifying patterns or anomalies in real-time data, surfaced in milliseconds
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
10. 10
What is it for?
Real Time Monitoring
● Log data monitoring, tracking and alerting
● Sensor / IoT data
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
11. 11
How does it works?
● Streaming SQL to Kafka Streams Apps Streaming SQL
Statement
12. 12
How does it works?
● Streaming SQL to Kafka Streams Apps Streaming SQL
Statement
13. 13
How does it works?
● Streaming SQL to Kafka Streams Apps
● Continuously read from source topic(s),
process, and write the results into sink
topic
Streaming SQL
Statement
Source
Sink
15. 15
Example
● Stream of shipments events!
CREATE STREAM shipments (
ID VARCHAR,
ORDER_ID VARCHAR,
STREET VARCHAR,
CITY VARCHAR,
STATE VARCHAR,
ZIPCODE VARCHAR,
EMAIL VARCHAR,
PHONE VARCHAR
) WITH (KAFKA_TOPIC=’ShipmentsTopic’, VALUE_FORMAT=’JSON’);
16. 16
Example
● Sample continuous queries:
○ All shipments to CA:
CREATE STREAM ca_shipments AS
SELECT *
FROM shipments
WHERE STATE = ’CA’;
17. 17
Example
● Sample continuous queries:
○ All shipments to CA:
○ Daily shipments count for each zipcode:
CREATE STREAM ca_shipments AS
SELECT *
FROM shipments
WHERE STATE = ’CA’;
CREATE TABLE zip_daily_shipment_count AS
SELECT ZIPCODE, COUNT(*)
FROM shipments
WINDOW tumbling (SIZE 1 DAY)
GROUP BY ZIPCODE;
30. 30
Functions
● Scalar Functions (Stateless)
○ Substring
○ Trim
○ Concat
○ Abs
○ Floor
○ ...
● Aggregate Functions (Stateful)
○ Count
○ Sum
○ Min
○ Max
○ ...
31. 31
Functions
● Scalar Functions (Stateless)
○ Substring
○ Trim
○ Concat
○ Abs
○ Floor
○ ...
● Aggregate Functions (Stateful)
○ Count
○ Sum
○ Min
○ Max
○ ...
What if I need a
function that is not
one of the KSQL
built-in functions?
32. 32
Functions
● User Defined Functions
(UDFs)
○ Stateless
● User Defined Aggregate
Functions (UDAFs)
○ Stateful
34. 34
UDFs/UDAFs
● How?
a. Write your UDF or UDAF class in Java.
b. Deploy the JAR file to the KSQL extensions directory.
35. 35
UDFs/UDAFs
● How?
a. Write your UDF or UDAF class in Java.
b. Deploy the JAR file to the KSQL extensions directory.
c. Use your function like any other KSQL function in your
queries.
36. 36
UDFs/UDAFs
● How?
a. Write your UDF or UDAF class in Java.
b. Deploy the JAR file to the KSQL extensions directory.
c. Use your function like any other KSQL function in your
queries.
37. 37
UDFs/UDAFs
● How?
a. Write your UDF or UDAF class in Java.
b. Deploy the JAR file to the KSQL extensions directory.
c. Use your function like any other KSQL function in your
queries.
38. 38
Write a UDF for KSQL
1. Create a project with dependency on ksql-udf
module
39. 39
Write a UDF for KSQL
1. Create a project with dependency on ksql-udf
module
Gradle:
compile 'io.confluent.ksql:ksql-udf:5.3.1'
40. 40
Write a UDF for KSQL
1. Create a project with dependency on ksql-udf
module
Gradle: Maven POM:
<repositories>
<repository>
<id>confluent</id>
<url>http://packages.confluent.io/maven/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>io.confluent.ksql</groupId>
<artifactId>ksql-udf</artifactId>
<version>5.3.1</version>
</dependency>
</dependencies>
compile 'io.confluent.ksql:ksql-udf:5.3.1'
41. 41
Write a UDF for KSQL
1. Create a project with dependency on ksql-udf
module
2. Create a class that is annotated with
@UdfDescription.
42. 42
Write a UDF for KSQL
1. Create a project with dependency on ksql-udf
module
2. Create a class that is annotated with
@UdfDescription.
UDF to validate email address
format
43. 43
Write a UDF for KSQL
1. Create a project with dependency on ksql-udf
module
2. Create a class that is annotated with
@UdfDescription.
import io.confluent.ksql.function.udf.Udf;
import io.confluent.ksql.function.udf.UdfDescription;
@UdfDescription(
name = "validateEmail",
description = "Validates email address format")
public class MyUDFs {
}
44. 44
Write a UDF for KSQL
1. Create a project with dependency on ksql-udf
module
2. Create a class that is annotated with
@UdfDescription.
import io.confluent.ksql.function.udf.Udf;
import io.confluent.ksql.function.udf.UdfDescription;
@UdfDescription(
name = "validateEmail",
description = "Validates email address format")
public class MyUDFs {
}
import io.confluent.ksql.function.udf.Udf;
import io.confluent.ksql.function.udf.UdfDescription;
45. 45
Write a UDF for KSQL
1. Create a project with dependency on ksql-udf
module
2. Create a class that is annotated with
@UdfDescription.
import io.confluent.ksql.function.udf.Udf;
import io.confluent.ksql.function.udf.UdfDescription;
@UdfDescription(
name = "validateEmail",
description = "Validates email address format")
public class MyUDFs {
} @UdfDescription(
name = "validateEmail “,
description = "Validates email address")
46. 46
Write a UDF for KSQL
1. Create a project with dependency on ksql-udf
module
2. Create a class that is annotated with
@UdfDescription.
47. 47
Write a UDF for KSQL
1. Create a project with dependency on ksql-udf
module
2. Create a class that is annotated with
@UdfDescription.
3. Implement UDFs as public methods with @Udf
annotation.
a. Use @UdfParameter annotation to provide
more info on UDF parameters (optional)
48. 48
Write a UDF for KSQL
● Email validator UDF.
import io.confluent.ksql.function.udf.Udf;
import io.confluent.ksql.function.udf.UdfDescription;
@UdfDescription(
name = "validateEmail “,
description = "Validates emails")
public class MyUDFs {
@Udf(description = "Validates email format.")
public boolean validateEmail(String email) {
final String EMAIL_REGEX = "^[w-+]+(.[w]+)*@[w-]+(.[w]+)*(.[a-z]{2,})$";
return Pattern.compile(EMAIL_REGEX, Pattern.CASE_INSENSITIVE).matcher(email).matches();
}
}
49. 49
Write a UDF for KSQL
● Email validator UDF.
import io.confluent.ksql.function.udf.Udf;
import io.confluent.ksql.function.udf.UdfDescription;
@UdfDescription(
description = "validateEmail “,
description = "Validates phone numbers and emails")
public class MyUDFs {
@Udf(description = "Validates email format.")
public boolean validateEmail(String email) {
final String EMAIL_REGEX = "^[w-+]+(.[w]+)*@[w-]+(.[w]+)*(.[a-z]{2,})$";
return Pattern.compile(EMAIL_REGEX, Pattern.CASE_INSENSITIVE).matcher(email).matches();
}
}
@Udf(description = "Validates email format.")
public boolean validateEmail(String email) {
final String EMAIL_REGEX = "^[w-+]+(.[w]+)*@[w-]+(.[w]+)*(.[a-z]{2,})$";
return Pattern.compile(EMAIL_REGEX, Pattern.CASE_INSENSITIVE).matcher(email).matches();
}
50. 50
UDFs/UDAFs
● How?
a. Write your UDF or UDAF class in Java.
b. Deploy the JAR file to the KSQL extensions directory.
c. Use your function like any other KSQL function in your
queries.
51. 51
Deploy UDFs in KSQL
● Build an uber-jar with all the dependencies
52. 52
Deploy UDFs in KSQL
● Build an uber-jar with all the dependencies
● Copy the uber-jar into the extension directory in each
KSQL server
○ Default is $KSQL_HOME/ext
○ Can be configured by ksql.extension.dir
property for KSQL server
53. 53
Deploy UDFs in KSQL
● Build an uber-jar with all the dependencies
● Copy the uber-jar into the extension directory in each
KSQL server
○ Default is $KSQL_HOME/ext
○ Can be configured by ksql.extension.dir
property for KSQL server
● Restart every KSQL server
54. 54
UDFs/UDAFs
● How?
a. Write your UDF or UDAF class in Java.
b. Deploy the JAR file to the KSQL extensions directory.
c. Use your function like any other KSQL function in your
queries.
55. 55
Use UDFs in KSQL Queries
● All shipments with invalid email address
CREATE STREAM invalid_shipments AS
SELECT *
FROM shipments
WHERE validateEamil(email) = false;
57. 57
UDFs in KSQL
● You can have much more complex UDFs
Deep Learning UDF for KSQL
for Streaming Anomaly
Detection of MQTT IoT Sensor
Data
https://github.com/kaiwaehner/ksql-udf-deep-learning-mqtt-iot
58. 58
Write a UDAF for KSQL
1. Create a project with dependency on ksql-udf module
59. 59
Write a UDAF for KSQL
1. Create a project with dependency on ksql-udf module
2. Create a class that is annotated with @UdafDescription.
60. 60
Write a UDAF for KSQL
1. Create a project with dependency on ksql-udf module
2. Create a class that is annotated with @UdafDescription.
UDAF to collect all order ids in a set
for a shipment
61. 61
Write a UDAF for KSQL
package testudaf;
import com.google.common.collect.Lists;
import io.confluent.ksql.function.udaf.Udaf;
import io.confluent.ksql.function.udaf.UdafDescription;
import io.confluent.ksql.function.udaf.UdafFactory;
import java.util.List;
@UdafDescription(
name = "collectOrderSet",
description = "Collect all the orders for a shipment..")
public final class CollectOrdersSet {
...
}
62. 62
Write a UDAF for KSQL
package testudaf;
import com.google.common.collect.Lists;
import io.confluent.ksql.function.udaf.Udaf;
import io.confluent.ksql.function.udaf.UdafDescription;
import io.confluent.ksql.function.udaf.UdafFactory;
import java.util.List;
@UdafDescription(
name = "collectOrderSet",
description = "Collect all the orders for a shipment..")
public final class CollectOrdersSet {
...
}
imports
63. 63
Write a UDAF for KSQL
package testudaf;
import com.google.common.collect.Lists;
import io.confluent.ksql.function.udaf.Udaf;
import io.confluent.ksql.function.udaf.UdafDescription;
import io.confluent.ksql.function.udaf.UdafFactory;
import java.util.List;
@UdafDescription(
name = "collectOrderSet",
description = "Collect all the orders for a shipment..")
public final class CollectOrdersSet {
...
}
imports
Udaf annotation
64. 64
Write a UDAF for KSQL
1. Create a project with dependency on ksql-udf module
2. Create a class that is annotated with @UdafDescription.
3. Implement UDAF Factories as public and static methods with
@UdafFactory annotation.
a. The factory methods should return Udaf or TableUdaf
b. Implement the UDAF logic in the returned Udaf or
TableUdaf
65. 65
Write a UDAF for KSQL
package testudaf;
import com.google.common.collect.Lists;
import io.confluent.ksql.function.udaf.Udaf;
import io.confluent.ksql.function.udaf.UdafDescription;
import io.confluent.ksql.function.udaf.UdafFactory;
import java.util.List;
@UdafDescription(
name = "collectOrderSet",
description = "Collect all the orders for a shipment..")
public final class CollectOrdersSet {
private static final int LIMIT = 1000;
@UdafFactory(description = "Collect all the orders for a shipment..")
public static Udaf<String, List<String>> orderSetCollector() {
return new Udaf<String, List<String>>() {
// Implement TableUdaf methods
…
};
}}
imports
Udaf annotation
Udaf factory
66. 66
Write a UDAF for KSQL
return new Udaf<String, List<String>>() {
@Override
public List<String> initialize() {...}
@Override
public List<String> aggregate(final String thisValue, final List<String> aggregate) { ... }
@Override
public List<String> merge(final List<String> aggOne, final List<String> aggTwo) {...}
};
67. 67
Write a UDAF for KSQL
return new Udaf<String, List<String>>() {
@Override
public List<String> initialize() {...}
@Override
public List<String> aggregate(final String thisValue, final List<String> aggregate) { ... }
@Override
public List<String> merge(final List<String> aggOne, final List<String> aggTwo) {...}
};
// The initializer for the Aggregation
@Override
public List<String> initialize() {
return Lists.newArrayList();
}
68. 68
Write a UDAF for KSQL
return new Udaf<String, List<String>>() {
@Override
public List<String> initialize() {...}
@Override
public List<String> aggregate(final String thisValue, final List<String> aggregate) { ... }
@Override
public List<String> merge(final List<String> aggOne, final List<String> aggTwo) {...}
};
// Aggregates the current value into the existing aggregate
@Override
public List<String> aggregate(final String thisValue, final List<String> aggregate) {
if (aggregate.size() < LIMIT && !aggregate.contains(thisValue)) {
aggregate.add(thisValue);
}
return aggregate;
}
69. 69
Write a UDAF for KSQL
collectOrderSet(c)bar hi
Key Value
K1 {foo}
70. 70
Write a UDAF for KSQL
collectOrderSet(c)bar
Key Value
K1 {foo, hi}
hi
71. 71
Write a UDAF for KSQL
collectOrderSet(c) bar
Key Value
K1 {foo,hi, bar}
hi
72. 72
Write a UDAF for KSQL
return new Udaf<String, List<String>>() {
@Override
public List<String> initialize() {...}
@Override
public List<String> aggregate(final String thisValue, final List<String> aggregate) { ... }
@Override
public List<String> merge(final List<String> aggOne, final List<String> aggTwo) {...}
};
// Merge two aggregates when merging session windows
@Override
public List<String> merge(final List<String> aggOne, final List<String> aggTwo) {
for (final T thisEntry : aggTwo) {
if (aggOne.size() == LIMIT) { break; }
if (!aggOne.contains(thisEntry)) {
aggOne.add(thisEntry);
}
}
return aggOne;
}
73. 73
Write a UDAF for KSQL
collectOrderSet(c)
Key Value
K1(W1) {foo,hi, bar}
K1(W2) {bar, tab}
K1(W1) K1(W2)
foo hi bar bar tab
session
inactivity gap
74. 74
Write a UDAF for KSQL
collectOrderSet(c)
Key Value
K1(W1) {foo,hi, bar}
K1(W2) {bar, tab}
K1(W1) K1(W2)
foo hi bar bar tab
session
inactivity gap
hello
75. 75
Write a UDAF for KSQL
collectOrderSet(c)
Key Value
K1(W1) {foo,hi, bar}
K1(W2) {bar, tab}
K1(W1) K1(W2)
foo hi bar bar tab
session
inactivity gap
hello
76. 76
Write a UDAF for KSQL
collectOrderSet(c)
Key Value
K1(W1) {foo,hi, bar}
K1(W2) {bar, tab}
K1(W3)
foo hi bar bar tabhello
K1(W3) {foo, hi, bar, tab, hello}
77. 77
Use UDAFs in KSQL Queries
● Set of orders for each shipment per day
78. 78
Use UDAFs in KSQL Queries
● Set of orders for each shipment per day
CREATE TABLE shipment_orders AS
SELECT id, collectOrderSet(order_id)
FROM shipments
WINDOW tumbling (SIZE 24 HOURS)
GROUP BY id;
81. 81
Miscellaneous Considerations
● Security
○ Blacklisting classes
■ Optionally blacklist classes and packages such that
they can't be used from a UD(A)F.
■ resource-blacklist.txt in the extension
directory
83. 83
Miscellaneous Considerations
● Security
○ Blacklisting classes
○ SecuriyManager
■ Blocks attempts by any UD(A)Fs to fork processes
from the KSQL server.
■ Prevents them from calling System.exit(..)
86. 86
Miscellaneous Considerations
● Security
○ Blacklisting classes
○ SecuriyManager
● Metric Collection
○ Set the config ksql.udf.collect.metrics to true.
○ Collected Metrics:
■ Average/Max time for an invocation
■ Total number of invocations
■ The average number of invocations per second
91. 91
Miscellaneous Considerations
● Security
○ Blacklisting classes
○ SecuriyManager
● Metric Collection
● Configurable UDF
○ Only configs whose name is prefixed with
ksql.functions.<lowercase-udfname>. or
ksql.functions._global_. are accessible
92. 92
Shout out to Mitch Seymour
● Luna: a place for developers to publish their own UDFs /
UDAFs that may not otherwise be a good fit for
contributing to the KSQL codebase, itself
https://magicalpipelines.com/luna/