Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

1
Using
Morphlines
for
on-‐the-‐ﬂy
ETL

Wolfgang
Hoschek
(@whoschek)

SF
Data
Engineering
Meetup
July
2013

Agenda

•  Big
Data,
ETL
and
Search
–
seMng
the
stage

•  Cloudera
Morphlines
Architecture

•  Component
Deep
Dive

•  Cloudera
Search
Use
Cases

•  What’s
next?

Feel
free
to
ask
quesUons
as
we
go!

Example
ETL
Use
Case:

Distributed
Search
on
Hadoop

Flume

Hue
UI

Custom

UI

Custom

App

Solr

Solr

Solr

SolrCloud

query

query

query

Index

(ETL)

Hadoop
Cluster

MR

HDFS

Index

(ETL)

HBase

Index

(ETL)

Cloudera
Morphlines
Architecture

Solr

Solr

Solr

SolrCloud

Logs,
tweets,
social

media,
html,

images,
pdf,
text….

Anything
you
want

to
index

Flume,
MR
Indexer,
HBase
indexer,
etc...

Or
your
applicaUon!

Morphline
Library

Morphlines
can
be
embedded
in
any
applicaUon…

Your
App!

Cloudera
Morphlines

•  Open
Source
framework
for
simple
ETL

•  Consume
any
kind
of
data
from
any
kind
of
data
source,
process
and

load
into
any
app
or
storage
system

•  Designed
for
Near
Real
Time
apps
&
Batch
apps

•  Ships
as
part
Cloudera
Developer
Kit
(CDK)
and
Cloudera
Search

•  It’s
a
Java
library

•  ASL
licensed
on
github
hbps://github.com/cloudera/cdk

•  Similar
to
Unix
pipelines,
but
more
convenient
&
efficient

•  ConfiguraUon
over
coding
(reduce
Ume
&
skills)

•  Supports
common
file
formats

•  Log
Files
&
Text

•  Avro,
Sequence
file

•  JSON,
HTML
&
XML

•  Etc…
(pluggable)

•  Extensible
set
of
transformaUon
commands

ExtracUon,
TransformaUon
and
Loading

•  Chain
of
pipelined

commands

•  Simple
and
ﬂexible
data

mapping
&
transformaUon

•  Reusable
across
mulUple

index
workloads

•  Over
Ume,
extend
and
re-‐
use
across
plagorm

workloads

syslog
Flume

Agent

Solr
sink

Command:
readLine

Command:
grok

Command:
loadSolr

Solr

Event

Record

Record

Record

Document

Morphline
Library

Like
a
Unix
Pipeline

•  Like
Unix
pipelines
where
the
data
model
is

generalized
to
work
with
streams
of
generic
records,

including
arbitrary
binary
payloads

•  Designed
to
be
embedded
into
Hadoop
components

such
as
Search,
Flume,
MapReduce,
Pig,
Hive,
Sqoop

Stdlib
+
plugins

•  Framework
ships
with
a
set
of
frequently
used
high

level
transformaUon
and
I/O
commands
that
can
be

combined
in
applicaUon
speciﬁc
ways

•  The
plugin
system
allows
the
adding
of
new

transformaUons
and
I/O
commands
and
integrates

exisUng
funcUonality
and
third
party
systems
in
a

straighgorward
manne

Flexible
Data
Model

•  A
record
is
a
set
of
named
fields
where
each
field
has

an
ordered
list
of
one
or
more
Java
Objects
(i.e.

Guava’s
ArrayListMulUmap)

•  Field
can
have
mulUple
values
and
any
two
records

need
not
use
common
field
names

•  Corresponds
exactly
to
Solr/Lucene
data
model

•  Pass
not
only
structured
data,
but
also
arbitrary

binary
data

Passing
Binary
Data

•  _abachment_body
field
(opUonal)

•  java.io.InputStream
or
Java
byte[]

•  opUonal
fields
assist
w/
detecUng
&
parsing
data
type

•  _abachment_mimetype
field

•  e.g.
"applicaUon/pdf"

•  _abachment_charset
field

•  e.g.
"UTF-‐8"

•  _abachment_name
field

•  e.g.
"cars.pdf”

•  Conceptually
similar
to
email
and
HTTP
headers/body

Processing
Model

•  Morphline
commands
manipulate
conUnuous
or

arbitrarily
large
streams
of
records

•  A
command
transforms
a
record
into
zero
or
more

records

•  The
output
records
of
a
command
are
passed
to
the

next
command
in
the
chain

•  A
command
can
contain
nested
commands

•  A
morphline
is
a
tree
of
commands,
essenUally
a

push-‐based
data
ﬂow
engine

Processing
Model
Non-‐Goals

•  Designed
to
embedded
into
mulUple
host
systems,
thus…

•  No
noUon
of
persistence
or
durability
or
distributed

compuUng
or
node
failover

•  Basically
just
a
chain
of
in-‐memory
transformaUons
in
the

current
thread

•  No
need
to
manage
mulUple
nodes
or
threads
-‐

already

covered
by
host
systems
such
as
MapReduce,
Flume,

Storm,
etc.

•  However,
a
morphline
does
support
passing
noUﬁcaUons

•  E.g.
BEGIN_TRANSACTION,
COMMIT_TRANSACTION,

ROLLBACK_TRANSACTION,
SHUTDOWN

Performance
and
Scaling

•  The
runUme
compiles
morphline
on
the
ﬂy

•  The
runUme
processes
all
commands
of
a
given

morphline
in
the
same
thread

•  For
scalability,
deploy
many
morphline
instances
on
a

cluster
in
many
Flume
agents
and
MapReduce
tasks

Syntax

•  HOCON
format
(Human-‐OpUmized
Config
Object

NotaUon)

•  Basically
JSON
slightly
adjusted
for
the
configuraUon

file
use
case

•  Came
out
of
typesafe.com

•  Also
used
by
Akka
and
Play
frameworks

Example:
Indexing
log4j
w/
stacktraces

juil. 25, 2012 10:49:40 AM hudson.triggers.SafeTimerTask run ok
juil. 25, 2012 10:49:46 AM hudson.triggers.SafeTimerTask run failed
com.amazonaws.AmazonClientException: Unable to calculate a request signature
at com.amazonaws.auth.AbstractAWSSigner.signAndBase64Encode(AbstractAWSSigner.java:71)
at java.util.TimerThread.run(Timer.java:505)
Caused by: com.amazonaws.AmazonClientException: Unable to calculate a request signature
at com.amazonaws.auth.AbstractAWSSigner.sign(AbstractAWSSigner.java:90)
at com.amazonaws.auth.AbstractAWSSigner.signAndBase64Encode(AbstractAWSSigner.java:68)
... 14 more
Caused by: java.lang.IllegalArgumentException: Empty key
at javax.crypto.spec.SecretKeySpec.<init>(SecretKeySpec.java:96)
at com.amazonaws.auth.AbstractAWSSigner.sign(AbstractAWSSigner.java:87)
... 15 more
juil. 25, 2012 10:49:54 AM hudson.slaves.SlaveComputer tryReconnect
Record
1

Record
2

Record
3

Example:
Indexing
log4j
w/
stacktraces

morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{
readMultiLine {
regex : "(^.+Exception: .+)|(^s+at .+)|(^s+... d+ more)|(^s*Caused by:.+)"
what : previous
charset : UTF-8
}
}
{ logDebug { format : "output record: {}", args : ["@{}"] } }
{ loadSolr {}
]
}
]

Example:
Escape
to
Java
Code

morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{ java
{
code: """
List tags = record.get("tags");
if (!tags.contains("hello")) {
return false;
}
tags.add("world");
return child.process(record);
"""
}
}
]
}
]

Current
Command
Library

•  Integrate
with
and
load
into
Apache
Solr

•  Flexible
log
ﬁle
analysis

•  Single-‐line
record,
mulU-‐line
records,
CSV
ﬁles

•  Regex
based
pabern
matching
and
extracUon

•  IntegraUon
with
Avro,
JSON,
XML,
HTML

•  IntegraUon
with
Apache
Hadoop
Sequence
Files

•  IntegraUon
with
SolrCell
and
all
Apache
Tika
parsers

•  Auto-‐detecUon
of
MIME
types
from
binary
data
using

Apache
Tika

Current
Command
Library
(cont’d)

•  ScripUng
support
for
dynamic
java
code

•  OperaUons
on
fields
for
assignment
and
comparison

•  OperaUons
on
fields
with
list
and
set
semanUcs

•  if-‐then-‐else
condiUonals

•  A
small
rules
engine
(tryRules)

•  String
and
Umestamp
conversions

•  slf4j
logging

•  Yammer
metrics
and
counters

•  Decompression
and
unpacking
of
arbitrarily
nested

container
file
formats

•  etc

Plugin
Commands

•  Easy
to
add
new
I/O
&
transformaUon
cmds

•  Integrate
exisUng
funcUonality
and
third
party

systems

•  Implement
Java
interface
Command
or
subclass

AbstractCommand
•  Add
it
to
Java
classpath

•  No
registraUon
or
other
administraUve
acUon

required

Morphline
Example
–
syslog
with
grok

morphlines
:
[

{

id
:
morphline1

importCommands
:
["com.cloudera.**",
"org.apache.solr.**"]

commands
:
[

{
readLine
{}
}

{

grok
{

dicUonaryFiles
:
[/tmp/grok-‐dicUonaries]

expressions
:
{

message
:
"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Umestamp}
%
{SYSLOGHOST:syslog_hostname}
%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%
{GREEDYDATA:syslog_message}"""

}

}

}

{
loadSolr
{}
}

]

}

]

Example
Input

<164>Feb

4
10:46:14
syslog
sshd[607]:
listening
on
0.0.0.0
port
22

Output
Record

syslog_pri:164

syslog_Umestamp:Feb

4
10:46:14

syslog_hostname:syslog

syslog_program:sshd

syslog_pid:607

syslog_message:listening
on
0.0.0.0
port
22.

PotenUal
New
Plugin
Commands

•  Extract,
clean,
transform,
join,
integrate,
enrich
and

decorate
records

•  Examples

•  join
records
with
external
data
sources
such
as
relaUonal

databases,
key-‐value
stores,
local
ﬁles
or
IP
Geo
lookup

tables.

•  Perform
DNS
resoluUon,
expand
shortened
URLs

•  fetch
linked
metadata
from
social
networks

•  do
senUment
analysis
&
annotate
record
accordingly

•  conUnuously
maintain
stats
over
sliding
windows

•  compute
exact
or
approx.
disUnct
values
&
quanUles

Use
Case:
Cloudera
Search

An
Integrated
Part
of

the
Hadoop
System

One
pool
of
data

One
security
framework

One
set
of
system
resources

One
management
interface

What
is
Cloudera
Search?

•  Full-‐text,
interacUve
search
and
faceted
navigaUon

•  Batch,
near
real-‐Ume,
and
on-‐demand
indexing

•  Apache
Solr
integrated
with
CDH

•  Established,
mature
search
with
vibrant
community

•  Separate
runUme
like
MapReduce,
Impala

•  Incorporated
as
part
of
the
Hadoop
ecosystem

•  Open
Source

•  100%
Apache,
100%
Solr

•  Standard
Solr
APIs

ETL
for
Distributed
Search
on
Apache
Hadoop

Flume

Hue
UI

Custom

UI

Custom

App

Solr

Solr

Solr

SolrCloud

query

query

query

Index

(ETL)

Hadoop
Cluster

MR

HDFS

Index

(ETL)

HBase

Index

(ETL)

Near
Real
Time
ETL
&
Indexing
with
Flume

Log
File

Apache
Solr
and

Apache
Flume

•  Data
ingest
at
scale

•  Flexible
extracUon
and

mapping

•  Indexing
at
data
ingest

•  Packaged
as
Flume

Morphline
Solr
Sink

HDFS

Flume

Agent

Indexer
w/

Morphline

Other
Log
File

Flume

Agent

Indexer
w/

Morphline

26

agent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent.sinks.solrSink.morphlineFile = /etc/flume-ng/conf/morphline.conf
Flume.conf

Cloudera
Manager
Flume
Morphline
GUI

27

Scalable
Batch
ETL
&
Indexing

Index

shard

Files

Index

shard

Indexer
w/

Morphline

Files

Solr

server

Indexer
w/

Morphline

Solr

server

28
HDFS

Solr
and
MapReduce

•  Flexible,
scalable
batch

indexing

•  Start
serving
new
indices

with
no
downUme

•  On-‐demand
indexing,
cost-‐
eﬃcient
re-‐indexing

•  Packaged
as

MapReduceIndexerTool

hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...

MapReduceIndexerTool

29
hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
S0_0_0
Extractors
(Mappers)
Leaf Shards
(Reducers)
Root Shards
(Mappers)
S0_0_1
S0S0_1_0
S0_1_1
S1_0_0
S1_0_1
S1S1_1_0
S1_1_1
Input
Files
...
...
...
...
•  Morphline
runs
inside
Mapper

Near
Real
Time
indexing
of
Apache
HBase

HDFS

HBase

interacUve
load

Lily
HBase

Indexer(s)

with

Morphline

Triggers
on

updates

Solr
server

Solr
server

Solr
server

Solr
server

Solr
server

Search

+
=

Large
scale
tabular
data

immediate
access
&
updates

fast
&
ﬂexible
informaDon

discovery

BIG
DATA
DATAMANAGEMENT

Batch
&
Near
Real
Time
ETL

Tweets
Flume Solr
Hue UI
HDFS
MapReduceIndexerTool, Impala, HBase, Mahout, EDW, MR, etc
Lily HBase Indexer
HdfsSink
Query
MapReduce
IndexerTool
Log Formats
Social Media
HTML
Images
PDF
Custom UI
Query
Custom App
...
Morphline
Morphline
MorphlineSink
Morphline
HBase
OLTP

What’s
next

•  More
work
on
Apache
HBase
IntegraUon

•  IntegraUon
into
Apache
Crunch

•  Stream
AnalyUcs

Conclusion

•  Cloudera
Development
Kit
w/
Morphlines

•  Open
Source
-‐
ASL
License

•  Version
0.4.1
shipping

•  Extensive
documentaUon

•  Send
your
quesUons
and
feedback
to
cdk-‐dev
mailing
list

•  Also
ships
integrated
with
Cloudera
Search

•  Free
QuickStart
VM
also
available!

Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Semelhante a Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek (20)

Mais de Hakka Labs

Mais de Hakka Labs (20)

Último

Último (20)

Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek