The document discusses stream processing using SQL with Norikra. It begins with an overview of stream processing basics and using SQL for queries on streaming data. It then provides details on Norikra, an open source software that allows for schema-less stream processing with SQL. Key features of Norikra include running on JRuby, supporting complex event types, and allowing SQL queries on streaming data without needing to restart for schema changes. Examples of Norikra queries are also shown.
8. Data Flow And Latency
data window
query execution
Batch Stream
incremental
query execution
9. Query For Stored Data
table
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
At first, all data
MUST be stored.
10. Query For Stored Data
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)
FROM table
WHERE v3=’x’ GROUP BY v1,v2
table
11. Query For Stored Data
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)
FROM table
WHERE v3=’x’ GROUP BY v1,v2
table
SELECT v4,COUNT(*)
FROM table
WHERE v1 AND v2 GROUP BY v4
12. Query For Stored Data
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)
FROM table
WHERE v3=’x’ GROUP BY v1,v2
table
SELECT v4,COUNT(*)
FROM table
WHERE v1 AND v2 GROUP BY v4
“All data” means
“data that will not be used”.
13. Query For Stream Data
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)
FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
SELECT v4,COUNT(*)
FROM table.win:xxx
WHERE v1 AND v2 GROUP BY v4
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
14. Query For Stream Data
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)
FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
SELECT v4,COUNT(*)
FROM table.win:xxx
WHERE v1 AND v2 GROUP BY v4
v1,v2,v3
v1,v2,v4
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
15. Query For Stream Data
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)
FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
SELECT v4,COUNT(*)
FROM table.win:xxx
WHERE v1 AND v2 GROUP BY v4
v1,v2,v3
v1,v2,v3,v4,v5,v6 v1,v2,v4
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
16. Query For Stream Data
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)
FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
SELECT v4,COUNT(*)
FROM table.win:xxx
WHERE v1 AND v2 GROUP BY v4
v1,v2,v3
v1,v2,v4
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
All data will be discarded
right after insertion.
(Bye-bye storage system maintenance!)
17. Incremental Calculation
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)
FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
internal data (memory)
v1 v2 COUNT
TRUE TRUE 0
TRUE FALSE 1
FALSE TRUE 33
FALSE FALSE 2
18. Incremental Calculation
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)
FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
internal data (memory)
v1 v2 COUNT
TRUE TRUE 1
TRUE FALSE 1
FALSE TRUE 33
FALSE FALSE 2
19. Incremental Calculation
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)
FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
internal data (memory)
v1 v2 COUNT
TRUE TRUE 1
TRUE FALSE 1
FALSE TRUE 34
FALSE FALSE 2
20. Incremental Calculation
SELECT v1,v2,COUNT(*)
FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
stream
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
internal data (memory)
v1 v2 COUNT
TRUE TRUE 1
TRUE FALSE 2
FALSE TRUE 37
FALSE FALSE 3
memory can store
internal data
21. Data Window
Target time (or size) range of queries
Batch
FROM-TO: WHERE dt >= ‘2014-09-13 13:30:00‘
AND dt < ‘2014-09-13 14:20:00’
Stream
“Calculate this query every 50 minutes”
Extended SQL required SELECT v1,v2,COUNT(*)
FROM table.win:xxx
WHERE v3=’x’ GROUP BY v1,v2
22. Stream Processing With SQL
Esper: Java library to process stream
needs to be implemented in Java
daemon code
With schema for data/query
OSS under GPLv2
http://esper.codehaus.org/
23. Esper EPL
Select values of height and weight
for all events with age larger than 30
SELECT height, weight
FROM tbl
WHERE age > 30
24. Esper EPL
Count records group by height value
for events with age larger than 30
SELECT height, COUNT(*) AS c
FROM tbl
WHERE age > 30
GROUP BY height
This query doesn’t
ever produce results
25. Esper EPL
Count records group by height value
for events with age larger than 30
per every 1 hour
SELECT height, COUNT(*) AS c
FROM tbl.win:time_batch(1 hour)
WHERE age > 30
GROUP BY height
26. With/without Schema
Schema-full data:
strict schema: predefined fields w/ types (or reject)
schema on read: try to read known fields (or ignore)
Schema-less data:
Any field (or ignore), any type (implicit/explicit conversion)
fit for services under development:
All internet services including us!
27. Stream Processing & Schema
Queries first, data second
for all stream processing
Queries automatically know what fields to query
schema-less (mixed)
data stream
fields subset
for query A
fields subset
for query B
query A
query B
events from
API endpoint
events from
billing service
events of service X
TO BE
31. Norikra:
Schema-less Stream Processing with SQL
Server software, runs on JVM
Open source software (GPLv2)
http://norikra.github.io/
https://github.com/norikra/norikra
32. Norikra:
Schema-less event stream:
Add/Remove data fields whenever you want
SQL:
No more restarts to add/remove queries
w/ JOINs, w/ SubQueries
w/ UDF (in Java/Ruby from rubygem)
Truly Complex events:
Nested Hash/Array, accessible directly from SQL
HTTP RPC w/ JSON or MessagePack (fluentd plugin available!)
33. How To Setup Norikra:
Install JRuby
download jruby.tar.gz, extract it and export $PATH
use rbenv
rbenv install jruby-1.7.xx
rbenv shell jruby-..
Install Norikra
gem install norikra
Execute Norikra server
norikra start
34. Norikra Interface:
Command line: norikra-client
norikra-client target open ...
norikra-client query add ...
tail -f ... | norikra-client event send ...
WebUI
show status
show/add/remove queries
HTTP API
JSON, MessagePack
36. Norikra Queries: (1)
{“name”:”tagomoris”,
“age”:34, “address”:”Tokyo”,
“corp”:”LINE”, “current”:”Taipei”}
SELECT name, age
FROM events
{“name”:”tagomoris”,”age”:34}
37. Norikra Queries: (1)
{“name”:”tagomoris”,
“address”:”Tokyo”,
“corp”:”LINE”, “current”:”Taipei”}
without “age”
SELECT name, age
FROM events
nothing
38. Norikra Queries: (2)
{“name”:”tagomoris”,
“age”:34, “address”:”Tokyo”,
“corp”:”LINE”, “current”:”Taipei”}
SELECT name, age
FROM events
WHERE current=”Taipei”
{“name”:”tagomoris”,”age”:34}
39. Norikra Queries: (2)
{“name”:”hadoop”,
“age”:99, “address”:”Somewhere”,
“corp”:”ASF”, “current”:”Elsewhere”}
SELECT name, age
FROM events
WHERE current=”Taipei”
nothing
40. Norikra Queries: (3)
SELECT age, COUNT(*) as cnt
FROM events.win:time_batch(5 mins)
GROUP BY age
41. Norikra Queries: (3)
{“name”:”tagomoris”,
“age”:34, “address”:”Tokyo”,
“corp”:”LINE”, “current”:”Taipei”}
SELECT age, COUNT(*) as cnt
FROM events.win:time_batch(5 mins)
GROUP BY age
every 5 mins
{”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...
42. Norikra Queries: (4)
{“name”:”tagomoris”,
“age”:34, “address”:”Tokyo”,
“corp”:”LINE”, “current”:”Taipei”}
SELECT age, COUNT(*) as cnt
FROM
events.win:time_batch(5 mins)
GROUP BY age
SELECT max(age) as max
FROM
events.win:time_batch(5 mins)
{”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...
{“max”:51}
every 5 mins
43. Norikra Queries: (5)
{“name”:”tagomoris”,
“user:{“age”:34, “corp”:”LINE”,
“address”:”Tokyo”},
“current”:”Taipei”,
“speaker”:true,
“attend”:[true,true,false, ...]
}
SELECT age, COUNT(*) as cnt
FROM events.win:time_batch(5 mins)
GROUP BY age
44. Norikra Queries: (5)
{“name”:”tagomoris”,
“user:{“age”:34, “corp”:”LINE”,
“address”:”Tokyo”},
“current”:”Taipei”,
“speaker”:true,
“attend”:[true,true,false, ...]
}
SELECT user.age, COUNT(*) as cnt
FROM events.win:time_batch(5 mins)
GROUP BY user.age
45. Norikra Queries: (5)
{“name”:”tagomoris”,
“user:{“age”:34, “corp”:”LINE”,
“address”:”Tokyo”},
“current”:”Taipei”,
“speaker”:true,
“attend”:[true,true,false, ...]
}
SELECT user.age, COUNT(*) as cnt
FROM events.win:time_batch(5 mins)
WHERE current=”Taipei”
AND attend.$0 AND attend.$1
GROUP BY user.age
47. Use case 1:
External API call reports for partners (LINE)
External API call for LINE Business Connect
LINE backend sends requests to partner’s API
endpoint using users’ messages
http://developers.linecorp.com/blog/?p=3386
48. Use case 1:
External API call reports for partners (LINE)
API error response summaries
http://developers.linecorp.com/blog/?p=3386
49. Use case 1:
External API call reports for partners (LINE)
channel
gateway
partner’s
server
logs
query
results
MySQL Mail
SELECT
channelId
AS
channel_id,
reason,
detail,
count(*)
AS
error_count,
min(timestamp)
AS
first_timestamp,
max(timestamp)
AS
last_timestamp
FROM
api_error_log.win:time_batch(60
sec)
GROUP
BY
channelId,reason,detail
HAVING
count(*)
>
0
http://developers.linecorp.com/blog/?p=3386
50. Use case 2:
Prompt reports for Ad service console
Prompt reports with Norikra + Fixed reports with Hive
app
serverapp
serverapp
server
app
serverapp
serverapp
server
Fluentd
HDFS
console
service
execute hive query
(daily)
fetch query results
(frequently)
impression
logs
51. Use case 2:
Prompt reports for Ad service console
Hive query for fixed reports
SELECT
yyyymmdd,
hh,
campaign_id,
region,
lang,
COUNT(*)
AS
click,
COUNT(DISTINCT
member_id)
AS
uu
FROM
(
SELECT
yyyymmdd,
hh,
get_json_object(log,
'$.campaign.id')
AS
campaign_id,
get_json_object(log,
'$.member.region')
AS
region,
get_json_object(log,
'$.member.lang')
AS
lang,
get_json_object(log,
'$.member.id')
AS
member_id
FROM
applog
WHERE
service='myservice'
AND
yyyymmdd='20140913'
AND
get_json_object(log,
'$.type')='click'
)
x
GROUP
BY
yyyymmdd,
hh,
campaign_id,
region,
lang
52. Use case 2:
Prompt reports for Ad service console
Norikra query for prompt reports
SELECT
campaign.id
AS
campaign_id,
member.region
AS
region,
member.lang
AS
lang,
COUNT(*)
AS
click,
COUNT(DISTINCT
member.id)
AS
uu
FROM
myservice.win:time_batch(1
hours)
WHERE
type="click"
GROUP
BY
campaign.id,
member.region,
member.lang
53. Use case 3:
Realtime access dashboard on Google Platform
Access log visualization
Count using Norikra (2-step), Store on Google BigQuery
Dashboard on Google Spreadsheet + Apps Script
http://qiita.com/kazunori279/items/6329df57635799405547
https://www.youtube.com/watch?v=EZkw5TDcCGw
54. Use case 3:
Realtime access dashboard on Google Platform
Server
Fluentd
http://qiita.com/kazunori279/items/6329df57635799405547
https://www.youtube.com/watch?v=EZkw5TDcCGw
ngnix
access log
access logs
to BigQuery
norikra query results
norikra query to aggregate node
to aggregate locally
55. Use case 3:
Realtime access dashboard on Google Platform
Fluentd
logs to store
http://qiita.com/kazunori279/items/6329df57635799405547
https://www.youtube.com/watch?v=EZkw5TDcCGw
ngnix
70 servers, 120,000 requests/sec (or more!)
ngngninxix ngngninxix ngngninxix ngngninxix
ngngninxix ngngninxix ngngninxix ngngninxix ngnix
Google
BigQuery
Google
Spreadsheet
+ Apps script
...
counts per host
total count
56. More queries, more simplicity
and less latency.
Thanks!
photo: by my co-workers
57. See also:
http://norikra.github.io/
“Stream processing and Norikra”
http://www.slideshare.net/tagomoris/stream-processing-and-norikra
“Batch processing and Stream processing by SQL”
http://www.slideshare.net/tagomoris/hcj2014-sql
“Log analysis systems and its designs in LINE Corp 2014 Early”
http://www.slideshare.net/tagomoris/log-analysis-system-and-its-designs-in-line-
corp-2014-early
“Norikra in Action”
http://www.slideshare.net/tagomoris/norikra-in-action-ver-2014-spring
58. HA? Distributed?
NO!
I have some idea, but I have no time to implement it
There are no needs for HA/Distributed processing