Mais conteĂșdo relacionado Semelhante a Embeddable data transformation for real time streams (20) Mais de Joey Echeverria (10) Embeddable data transformation for real time streams1. © Rocana, Inc. All Rights Reserved. | 1
Joey Echeverria, Platform Technical Lead
Strata+Hadoop World, March 31st 2016
San Jose, CA
Embeddable data transformation for
real-time streams
2. © Rocana, Inc. All Rights Reserved. | 2 http://j.mp/hw-questions
Slides
http://j.mp/rocana-transform-slides
3. © Rocana, Inc. All Rights Reserved. | 3 http://j.mp/hw-questions
Questions
http://j.mp/hw-questions
4. © Rocana, Inc. All Rights Reserved. | 4 http://j.mp/hw-questions
Context
5. © Rocana, Inc. All Rights Reserved. | 5 http://j.mp/hw-questions
Joey
âą Where I work: Rocana â Platform Technical Lead
âą Where I used to work: Cloudera (â11-â15), NSA
âą Distributed systems, security, data processing, big data
6. © Rocana, Inc. All Rights Reserved. | 6
Signing today at 1pm at the
Cloudera booth
7. © Rocana, Inc. All Rights Reserved. | 7 http://j.mp/hw-questions
History
8. © Rocana, Inc. All Rights Reserved. | 8 http://j.mp/hw-questions
Spark
Impala
âLegacyâ data architecture
HDFS
Avro/Parquet FilesFlume/Sqoop
Data Producers
MapReduc
e
Visualization/Query
9. © Rocana, Inc. All Rights Reserved. | 9 http://j.mp/hw-questions
Flink
Storm
Stream data architecture
Kafka
Avro Serialized
Recrods
Data Producers Spark Streaming
Real-time Visualization
HDFS
Avro/Parquet FilesKafka Consumers
10. © Rocana, Inc. All Rights Reserved. | 10 http://j.mp/hw-questions
Flink
Storm
Stream data architecture
Kafka
Avro Serialized
Recrods
Data Producers Spark Streaming
Real-time Visualization
HDFS
Avro/Parquet FilesKafka Consumers
11. © Rocana, Inc. All Rights Reserved. | 11 http://j.mp/hw-questions
Stream processing
A primer
12. © Rocana, Inc. All Rights Reserved. | 12 http://j.mp/hw-questions
Stream processing
âą Filter
âą Extract
âą Project
âą Aggregate
âą Join
âą Model
13. © Rocana, Inc. All Rights Reserved. | 13 http://j.mp/hw-questions
Stream processing
âą Filter
âą Extract
âą Project
âą Aggregate
âą Join
âą Model
14. © Rocana, Inc. All Rights Reserved. | 14 http://j.mp/hw-questions
Stream processing
âą Filter
âą Extract
âą Project
âą Aggregate
âą Join
âą Model
âą Data transformation
15. © Rocana, Inc. All Rights Reserved. | 15 http://j.mp/hw-questions
Apache Storm
âą "Distributed real-time computation system"
âą Applications packaged into topologies (think MapReduce job)
âą Topologies operate over streams of tuples
âą Spout: source of a stream
âą Bolt: arbitrary operation such as filtering, aggregating, joining, or
executing arbitrary functions
16. © Rocana, Inc. All Rights Reserved. | 16 http://j.mp/hw-questions
Apache Spark
âą Supports batch and stream processing
âą Continuous stream of records discretized into a DStream
âą DStream: a sequence of RDDs (batches of records)
âą Micro-batch
17. © Rocana, Inc. All Rights Reserved. | 17 http://j.mp/hw-questions
Apache Flink
âą Supports batch and stream processing
âą DataStream: unbounded collection of records
âą Operations can apply to individual records or windows of records
âą Supports record-at-a-time processing (like Storm)
18. © Rocana, Inc. All Rights Reserved. | 18 http://j.mp/hw-questions
Apache Kafka
âą Pub-sub messaging system implemented as a distributed commit log
âą Popular as a source and sink for data streams
âą Scalability, durability, and easy-to-understand delivery guarantees
âą Can do stream processing directly in Kafka consumers
19. © Rocana, Inc. All Rights Reserved. | 19 http://j.mp/hw-questions
Data transformation
20. © Rocana, Inc. All Rights Reserved. | 20 http://j.mp/hw-questions
Filter
filter
21. © Rocana, Inc. All Rights Reserved. | 21 http://j.mp/hw-questions
Extract
127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326
ts: 1436576671000
body: <binary blob>
event_type_id: 100
...
extract
ts: 1436576671000
body: <binary blob>
event_type_id: 100
attributes: {
ip: "127.0.0.1"
user_agent: "Mozilla/5.0"
user_id: "laura"
date: "[31/March/2016]"
request: "GET /index.html HTTP/1.0"
status_code: "200"
size: "2326"
}
22. © Rocana, Inc. All Rights Reserved. | 22 http://j.mp/hw-questions
Project
ts: 1436576671000
body: <binary blob>
event_type_id: 100
attributes: {
ip: "127.0.0.1"
user_agent: "Mozilla/5.0"
user_id: "laura"
date: "[31/March/2016]"
request: "GET /index.html HTTP/1.0"
status_code: "200"
size: "2326"
}
ts: 1459444413000
ip: "127.0.0.1"
user_agent: "Mozilla/5.0"
user_id: "laura"
request: "GET /index.html HTTP/1.0"
status_code: 200
size: 2326
project
23. © Rocana, Inc. All Rights Reserved. | 23 http://j.mp/hw-questions
Problem
24. © Rocana, Inc. All Rights Reserved. | 24 http://j.mp/hw-questions
Who
âą Developers
âą Data engineers
âą Sysadmins
âą Analysts
25. © Rocana, Inc. All Rights Reserved. | 25 http://j.mp/hw-questions
Tools
26. © Rocana, Inc. All Rights Reserved. | 26 http://j.mp/hw-questions
The dark art of data science
âą Feature engineering
âą âGetting a mess of raw data that can be used as input to a machine
learning algorithmâ - @josh_wills
âą Video from Midwest.io 2014
27. © Rocana, Inc. All Rights Reserved. | 27 http://j.mp/hw-questions
Data transformation for all
28. © Rocana, Inc. All Rights Reserved. | 28 http://j.mp/hw-questions
Rocana Transform
âą Library
âą Java
âą Rocana configuration
âą JSON + comments + specific numeric types - excess quoting
29. © Rocana, Inc. All Rights Reserved. | 29 http://j.mp/hw-questions
Data model
âą Event schema
âą id: A globally unique identifier for this event
âą ts: Epoch timestamp in milliseconds
âą event_type_id: ID indicating the type of the event
âą location: Location from which the event was generated
âą host: Hostname, IP, or other device identifier from which the event was
generated
âą service: Service or process from which the event was generated
âą body: Raw event content in bytes
âą attributes: Event type-specific key/value pairs
30. © Rocana, Inc. All Rights Reserved. | 30 http://j.mp/hw-questions
Example event
{
"id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====",
"event_type_id": 100,
"ts": 1436576671000,
"location": "aws/us-west-2a",
"host": "example01.rocana.com",
"service": "dhclient",
"body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from âŠ",
"attributes": {
"syslog_timestamp": "1436576671000",
"syslog_process": "dhclient",
"syslog_pid": "865",
"syslog_facility": "3",
"syslog_severity": "6",
"syslog_hostname": "example01",
"syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)"
}
}
31. © Rocana, Inc. All Rights Reserved. | 31 http://j.mp/hw-questions
Filter, extract, and flatten
32. © Rocana, Inc. All Rights Reserved. | 32 http://j.mp/hw-questions
Filter, extract, and flatten
âą Filter out events without type id 100
âą Filter out events without hostname prefix "ex"
âą Extract a numeric prefix from the syslog message
âą Flatten syslog attributes to top-level fields in a different avro schema
33. © Rocana, Inc. All Rights Reserved. | 33 http://j.mp/hw-questions
Filter, extract, and flatten
{
load-event: {},
// Filter by event_type_id
filter: { expression: "${event_type_id == 100}" },
// Extract hostname prefix
regex: { ... },
filter: { expression: "${host_prefix.match.group.1 == 'ex'}",
// Extract a numeric prefix from the syslog message
regex: { ... },
// Build flattened record
build-avro-record: { ... },
// Accumulate output record
accumulate-output: {
value: "${output_record}"
}
}
34. © Rocana, Inc. All Rights Reserved. | 34 http://j.mp/hw-questions
Extract hostname prefix
{
load-event: {},
filter: { expression: "${event_type_id == 100}" },
regex: {
pattern: "^(.{2}).*$",
value: "${attr.syslog_hostname}",
destination: "host_prefix"
},
filter: { expression: "${host_prefix.match.group.1 == 'ex'}",
...
}
35. © Rocana, Inc. All Rights Reserved. | 35 http://j.mp/hw-questions
Extract numeric prefix
...
filter: { expression: "${host_prefix.match.group.1 == 'ex'}",
regex: {
pattern: "^([0-9]*)",
value: "${attributes['syslog_message']}",
destination: "msg",
match-actions: {
set-values: { extracted_field: "${msg.match.group.1}" }
},
no-match-actions: {
set-values: { extracted_field: "" }
}
},
...
36. © Rocana, Inc. All Rights Reserved. | 36 http://j.mp/hw-questions
Build flattened record
...
build-avro-record: {
schema-uri: "resource:avro-schemas/flattened-syslog.avsc",
destination: "output_record",
field-mapping: {
ts: "${ts}",
event_type_id: "${event_type_id}",
source: "${source}",
syslog_facility: "${convert:toInt(attributes['syslog_facility'])}",
syslog_severity: "${convert:toInt(attributes['syslog_severity'])}",
...
syslog_message: "${attributes['syslog_message']}",
syslog_pid: "${convert:toInt(attributes['syslog_pid)}",
extracted_field: "${extracted_field}"
},
},
...
37. © Rocana, Inc. All Rights Reserved. | 37 http://j.mp/hw-questions
Extract metrics from log data
38. © Rocana, Inc. All Rights Reserved. | 38 http://j.mp/hw-questions
Extract metrics
âą Input: HTTP status logs
âą Extract request latency
âą Extract counts by HTTP status code
âą Metric types
âą Guage: A value that varies over time (think latency, CPU %, etc.)
âą Counter: A value that accumulates over time (think event volume, status codes,
etc.)
39. © Rocana, Inc. All Rights Reserved. | 39 http://j.mp/hw-questions
Example metric event
{
"id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====",
"event_type_id": 107,
"ts": 1436576671000,
"location": "aws/us-west-2a",
"host": "web01.rocana.com",
"service": "httpd",
"attributes": {
"m.http.request.latency": "4.2000000000E1|g",
"m.http.status.401.count": "1.0000000000E0|c",
}
}
40. © Rocana, Inc. All Rights Reserved. | 40 http://j.mp/hw-questions
Extract metrics
{
load-event: {},
build-metric: {
gauge-mapping: {
http.request.latency: "${convert:toDouble(attributes['latency'])}"
},
destination: "latency_metric"
},
accumulate-output: { value: "${latency_metric}" },
build-metric: {
dynamic-counter-mapping: [
"${string:format('http.status.%s.count', attributes['sc_status'])}", 1D
],
destination: "status_metric"
},
accumulate-output: { value: "${status_metric}" }
}
41. © Rocana, Inc. All Rights Reserved. | 41 http://j.mp/hw-questions
Architecture
42. © Rocana, Inc. All Rights Reserved. | 42 http://j.mp/hw-questions
Java action objects
Architecture
Configuration file Java action objects Context
Variables
Driver
1. Parse config
2. Initialize
context
5. Copy output
3. Execute actions
4. Read/write
variables
43. © Rocana, Inc. All Rights Reserved. | 43 http://j.mp/hw-questions
Custom actions
âą Actions loaded at runtime using Java services framework
âą Add your jar to the classpath
âą Custom actions appear as top-level keywords just like regular actions
âą Implement the execute() method of the Action interface
âą Implement the build() method of the ActionBuilder interface
44. © Rocana, Inc. All Rights Reserved. | 44 http://j.mp/hw-questions
Custom actions
âą Parse custom log formats
âą Cisco ACS
âą Citrix
âą Juniper
âą Customer-specific formats
âą Lookup IP addresses in the MaxMind GeoIP2 database
âą Reference dataset lookups
âą Device id to device name
45. © Rocana, Inc. All Rights Reserved. | 45 http://j.mp/hw-questions
Putting it all together
âą Stream processing is causing us to re-think how we analyze data
âą Limiting accessibility of data transformation side increases costs and
decreases velocity
âą Reduce your reliance on developers to code custom pipelines
âą Re-use transformation configuration in any stream processing framework
or batch job
46. © Rocana, Inc. All Rights Reserved. | 46 http://j.mp/hw-questions
Coming soon
âą Rocana transform will be released under the ASL 2.0
âą The base configuration library is available today:
âą https://github.com/scalingdata/rocana-configuration
47. © Rocana, Inc. All Rights Reserved. | 47 http://j.mp/hw-questions
Questions?
âą Signing "Hadoop Security" today at 1pm at the Cloudera booth