Norikra: Stream Processing with SQL

Norikra:
Stream Processing With SQL
2014/09/13
HadoopCon 2014 Taiwan
Satoshi Tagomori (@tagomoris)

Satoshi Tagomori (@tagomoris)
LINE Corporation
Analytics Platform Team

THE ONE THING
WHAT YOU MUST LEAN
TODAY IS

Topics
Basics of stream processing
Stream processing with SQL
Norikra overview
Norikra queries
Use cases in production

Stream Processing
Less latency
Less computing power
No query schedule management

Data Flow And Latency
data window
query execution
Batch Stream
incremental
query execution

Query For Stored Data
table
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
At first, all data
MUST be stored.

v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
SELECT v1,v2,COUNT(*)
FROM table
WHERE v3=’x’ GROUP BY v1,v2
table

v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
FROM table
table
SELECT v4,COUNT(*)
FROM table
WHERE v1 AND v2 GROUP BY v4

v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
FROM table
table
SELECT v4,COUNT(*)
FROM table
“All data” means
“data that will not be used”.

Query For Stream Data
v1,v2,v3,v4,v5,v6
FROM table.win:xxx
stream
SELECT v4,COUNT(*)
FROM table.win:xxx
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6

v1,v2,v3,v4,v5,v6
FROM table.win:xxx
stream
SELECT v4,COUNT(*)
FROM table.win:xxx
v1,v2,v3
v1,v2,v4
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6

v1,v2,v3,v4,v5,v6
FROM table.win:xxx
stream
SELECT v4,COUNT(*)
FROM table.win:xxx
v1,v2,v3
v1,v2,v3,v4,v5,v6 v1,v2,v4
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6

v1,v2,v3,v4,v5,v6
FROM table.win:xxx
stream
SELECT v4,COUNT(*)
FROM table.win:xxx
v1,v2,v3
v1,v2,v4
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
All data will be discarded
right after insertion.
(Bye-bye storage system maintenance!)

Incremental Calculation
v1,v2,v3,v4,v5,v6
FROM table.win:xxx
stream
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
internal data (memory)
v1 v2 COUNT
TRUE TRUE 0
TRUE FALSE 1
FALSE TRUE 33
FALSE FALSE 2

v1,v2,v3,v4,v5,v6
FROM table.win:xxx
stream
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1 v2 COUNT
TRUE TRUE 1
TRUE FALSE 1
FALSE TRUE 33
FALSE FALSE 2

v1,v2,v3,v4,v5,v6
FROM table.win:xxx
stream
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1 v2 COUNT
TRUE TRUE 1
TRUE FALSE 1
FALSE TRUE 34
FALSE FALSE 2

FROM table.win:xxx
stream
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1,v2,v3,v4,v5,v6
v1 v2 COUNT
TRUE TRUE 1
TRUE FALSE 2
FALSE TRUE 37
FALSE FALSE 3
memory can store
internal data

Data Window
Target time (or size) range of queries
Batch
FROM-TO: WHERE dt >= ‘2014-09-13 13:30:00‘
AND dt < ‘2014-09-13 14:20:00’
Stream
“Calculate this query every 50 minutes”
Extended SQL required SELECT v1,v2,COUNT(*)
FROM table.win:xxx

Stream Processing With SQL
Esper: Java library to process stream
needs to be implemented in Java
daemon code
With schema for data/query
OSS under GPLv2
http://esper.codehaus.org/

Esper EPL
Select values of height and weight
for all events with age larger than 30
SELECT height, weight
FROM tbl
WHERE age > 30

Esper EPL
Count records group by height value
for events with age larger than 30
SELECT height, COUNT(*) AS c
FROM tbl
WHERE age > 30
GROUP BY height
This query doesn’t
ever produce results

Esper EPL
Count records group by height value
for events with age larger than 30
per every 1 hour
SELECT height, COUNT(*) AS c
FROM tbl.win:time_batch(1 hour)
WHERE age > 30
GROUP BY height

With/without Schema
Schema-full data:
strict schema: predefined fields w/ types (or reject)
schema on read: try to read known fields (or ignore)
Schema-less data:
Any field (or ignore), any type (implicit/explicit conversion)
fit for services under development:
All internet services including us!

Stream Processing & Schema
Queries first, data second
for all stream processing
Queries automatically know what fields to query
schema-less (mixed)
data stream
fields subset
for query A
fields subset
for query B
query A
query B
events from
API endpoint
events from
billing service
events of service X
TO BE

Norikra:
Schema-less Stream Processing with SQL
Server software, runs on JVM
Open source software (GPLv2)
http://norikra.github.io/
https://github.com/norikra/norikra

Norikra:
Schema-less event stream:
Add/Remove data fields whenever you want
SQL:
No more restarts to add/remove queries
w/ JOINs, w/ SubQueries
w/ UDF (in Java/Ruby from rubygem)
Truly Complex events:
Nested Hash/Array, accessible directly from SQL
HTTP RPC w/ JSON or MessagePack (fluentd plugin available!)

How To Setup Norikra:
Install JRuby
download jruby.tar.gz, extract it and export $PATH
use rbenv
rbenv install jruby-1.7.xx
rbenv shell jruby-..
Install Norikra
gem install norikra
Execute Norikra server
norikra start

Norikra Interface:
Command line: norikra-client
norikra-client target open ...
norikra-client query add ...
tail -f ... | norikra-client event send ...
WebUI
show status
show/add/remove queries
HTTP API
JSON, MessagePack

Norikra Queries: (1)
SELECT name, age
FROM events
target

{“name”:”tagomoris”,
“age”:34, “address”:”Tokyo”,
“corp”:”LINE”, “current”:”Taipei”}
SELECT name, age
FROM events
{“name”:”tagomoris”,”age”:34}

“address”:”Tokyo”,
without “age”
SELECT name, age
FROM events
nothing

SELECT name, age
FROM events
WHERE current=”Taipei”
{“name”:”tagomoris”,”age”:34}

{“name”:”hadoop”,
“age”:99, “address”:”Somewhere”,
“corp”:”ASF”, “current”:”Elsewhere”}
SELECT name, age
FROM events
nothing

SELECT age, COUNT(*) as cnt
FROM events.win:time_batch(5 mins)
GROUP BY age

GROUP BY age
every 5 mins
{”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...

FROM
events.win:time_batch(5 mins)
GROUP BY age
SELECT max(age) as max
FROM
events.win:time_batch(5 mins)
{”age”:34,”cnt”:3}, {“age”:33,”cnt”:1}, ...
{“max”:51}
every 5 mins

“user:{“age”:34, “corp”:”LINE”,
“address”:”Tokyo”},
“current”:”Taipei”,
“speaker”:true,
“attend”:[true,true,false, ...]
}
GROUP BY age

“speaker”:true,
}
SELECT user.age, COUNT(*) as cnt
GROUP BY user.age

“speaker”:true,
}
SELECT user.age, COUNT(*) as cnt
AND attend.$0 AND attend.$1
GROUP BY user.age

Use case 1:
External API call reports for partners (LINE)
External API call for LINE Business Connect
LINE backend sends requests to partner’s API
endpoint using users’ messages
http://developers.linecorp.com/blog/?p=3386

Use case 1:
API error response summaries

Use case 1:
channel
gateway
partner’s
server
logs
query
results
MySQL Mail
SELECT
channelId
AS
channel_id,
reason,
detail,
count(*)
AS
error_count,
min(timestamp)
AS
first_timestamp,
max(timestamp)
AS
last_timestamp
FROM
api_error_log.win:time_batch(60
sec)
GROUP
BY
channelId,reason,detail
HAVING
count(*)
>
0

Use case 2:
Prompt reports for Ad service console
Prompt reports with Norikra + Fixed reports with Hive
app
serverapp
serverapp
server
app
serverapp
serverapp
server
Fluentd
HDFS
console
service
execute hive query
(daily)
fetch query results
(frequently)
impression
logs

Use case 2:
Hive query for fixed reports
SELECT
yyyymmdd,
hh,
campaign_id,
region,
lang,
COUNT(*)
AS
click,
COUNT(DISTINCT
member_id)
AS
uu
FROM
(
SELECT
yyyymmdd,
hh,
get_json_object(log,
'$.campaign.id')
AS
campaign_id,
'$.member.region')
AS
region,
'$.member.lang')
AS
lang,
'$.member.id')
AS
member_id
FROM
applog
WHERE
service='myservice'
AND
yyyymmdd='20140913'
AND
'$.type')='click'
)
x
GROUP
BY
yyyymmdd,
hh,
campaign_id,
region,
lang

Use case 2:
Norikra query for prompt reports
SELECT
campaign.id
AS
campaign_id,
member.region
AS
region,
member.lang
AS
lang,
COUNT(*)
AS
click,
COUNT(DISTINCT
member.id)
AS
uu
FROM
myservice.win:time_batch(1
hours)
WHERE
type="click"
GROUP
BY
campaign.id,
member.region,
member.lang

Use case 3:
Realtime access dashboard on Google Platform
Access log visualization
Count using Norikra (2-step), Store on Google BigQuery
Dashboard on Google Spreadsheet + Apps Script
http://qiita.com/kazunori279/items/6329df57635799405547
https://www.youtube.com/watch?v=EZkw5TDcCGw

Use case 3:
Server
Fluentd
ngnix
access log
access logs
to BigQuery
norikra query results
norikra query to aggregate node
to aggregate locally

Use case 3:
Fluentd
logs to store
ngnix
70 servers, 120,000 requests/sec (or more!)
ngngninxix ngngninxix ngngninxix ngngninxix
ngngninxix ngngninxix ngngninxix ngngninxix ngnix
Google
BigQuery
Google
Spreadsheet
+ Apps script
...
counts per host
total count

More queries, more simplicity
and less latency.
Thanks!
photo: by my co-workers

See also:
http://norikra.github.io/
“Stream processing and Norikra”
http://www.slideshare.net/tagomoris/stream-processing-and-norikra
“Batch processing and Stream processing by SQL”
http://www.slideshare.net/tagomoris/hcj2014-sql
“Log analysis systems and its designs in LINE Corp 2014 Early”
http://www.slideshare.net/tagomoris/log-analysis-system-and-its-designs-in-line-
corp-2014-early
“Norikra in Action”
http://www.slideshare.net/tagomoris/norikra-in-action-ver-2014-spring

HA? Distributed?
NO!
I have some idea, but I have no time to implement it
There are no needs for HA/Distributed processing

Data flow & API?
Use Fluentd!

Scalability?
10,000 - 100,000 events/sec
on 2CPU 8Core server

Storm or Norikra?
Simple and fixed workload for huge traffic
Use Storm!
Complex and fragile workload for non-huge traffic
Use Norikra!

Norikra: Stream Processing with SQL

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (14)

Semelhante a Norikra: Stream Processing with SQL

Semelhante a Norikra: Stream Processing with SQL (20)

Mais de SATOSHI TAGOMORI

Mais de SATOSHI TAGOMORI (20)

Último

Último (20)

Norikra: Stream Processing with SQL