SlideShare uma empresa Scribd logo
1 de 33
Baixar para ler offline
1
Using	
  Morphlines	
  for	
  on-­‐the-­‐fly	
  ETL	
  
Wolfgang	
  Hoschek	
  (@whoschek)	
  
SF	
  Data	
  Engineering	
  Meetup	
  July	
  2013	
  
Agenda	
  
•  Big	
  Data,	
  ETL	
  and	
  Search	
  –	
  seMng	
  the	
  stage	
  
•  Cloudera	
  Morphlines	
  Architecture	
  
•  Component	
  Deep	
  Dive	
  
•  Cloudera	
  Search	
  Use	
  Cases	
  
•  What’s	
  next?	
  
Feel	
  free	
  to	
  ask	
  quesUons	
  as	
  we	
  go!	
  
Example	
  ETL	
  Use	
  Case:	
  	
  
Distributed	
  Search	
  on	
  Hadoop	
  
Flume	
  
Hue	
  UI	
  
Custom	
  
UI	
  
Custom	
  
App	
  
Solr	
  
Solr	
  
Solr	
  
SolrCloud	
  
query	
  
query	
  
query	
  
Index	
  
(ETL)	
  
Hadoop	
  Cluster	
  
MR	
  
HDFS	
  
Index	
  
(ETL)	
  
HBase	
  
Index	
  
(ETL)	
  
Cloudera	
  Morphlines	
  Architecture	
  
Solr	
  
Solr	
  
Solr	
  
SolrCloud	
  
Logs,	
  tweets,	
  social	
  
media,	
  html,	
  
images,	
  pdf,	
  text….	
  
Anything	
  you	
  want	
  
to	
  index	
  
Flume,	
  MR	
  Indexer,	
  HBase	
  indexer,	
  etc...	
  
	
  Or	
  your	
  applicaUon!	
  
Morphline	
  Library	
  
Morphlines	
  can	
  be	
  embedded	
  in	
  any	
  applicaUon…	
  
Your	
  App!	
  
Cloudera	
  Morphlines	
  
•  Open	
  Source	
  framework	
  for	
  simple	
  ETL	
  
•  Consume	
  any	
  kind	
  of	
  data	
  from	
  any	
  kind	
  of	
  data	
  source,	
  process	
  and	
  
load	
  into	
  any	
  app	
  or	
  storage	
  system	
  
•  Designed	
  for	
  Near	
  Real	
  Time	
  apps	
  &	
  Batch	
  apps	
  
•  Ships	
  as	
  part	
  Cloudera	
  Developer	
  Kit	
  (CDK)	
  and	
  Cloudera	
  Search	
  
•  It’s	
  a	
  Java	
  library	
  
•  ASL	
  licensed	
  on	
  github	
  hbps://github.com/cloudera/cdk	
  
•  Similar	
  to	
  Unix	
  pipelines,	
  but	
  more	
  convenient	
  &	
  efficient	
  
•  ConfiguraUon	
  over	
  coding	
  (reduce	
  Ume	
  &	
  skills)	
  
•  Supports	
  common	
  file	
  formats	
  
•  Log	
  Files	
  &	
  Text	
  
•  Avro,	
  Sequence	
  file	
  
•  JSON,	
  HTML	
  &	
  XML	
  
•  Etc…	
  (pluggable)	
  
•  Extensible	
  set	
  of	
  transformaUon	
  commands	
  
ExtracUon,	
  TransformaUon	
  and	
  Loading	
  
•  Chain	
  of	
  pipelined	
  
commands	
  
•  Simple	
  and	
  flexible	
  data	
  
mapping	
  &	
  transformaUon	
  	
  
•  Reusable	
  across	
  mulUple	
  
index	
  workloads	
  
•  Over	
  Ume,	
  extend	
  and	
  re-­‐
use	
  across	
  plagorm	
  
workloads	
  
syslog	
   Flume	
  
Agent	
  
Solr	
  sink	
  
Command:	
  readLine	
  
Command:	
  grok	
  
Command:	
  loadSolr	
  
Solr	
  
Event	
  
Record	
  
Record	
  
Record	
  
Document	
  
Morphline	
  Library	
  
Like	
  a	
  Unix	
  Pipeline	
  
•  Like	
  Unix	
  pipelines	
  where	
  the	
  data	
  model	
  is	
  
generalized	
  to	
  work	
  with	
  streams	
  of	
  generic	
  records,	
  
including	
  arbitrary	
  binary	
  payloads	
  
•  Designed	
  to	
  be	
  embedded	
  into	
  Hadoop	
  components	
  
such	
  as	
  Search,	
  Flume,	
  MapReduce,	
  Pig,	
  Hive,	
  Sqoop	
  
Stdlib	
  +	
  plugins	
  
•  Framework	
  ships	
  with	
  a	
  set	
  of	
  frequently	
  used	
  high	
  
level	
  transformaUon	
  and	
  I/O	
  commands	
  that	
  can	
  be	
  
combined	
  in	
  applicaUon	
  specific	
  ways	
  
•  The	
  plugin	
  system	
  allows	
  the	
  adding	
  of	
  new	
  
transformaUons	
  and	
  I/O	
  commands	
  and	
  integrates	
  
exisUng	
  funcUonality	
  and	
  third	
  party	
  systems	
  in	
  a	
  
straighgorward	
  manne	
  
Flexible	
  Data	
  Model	
  
•  A	
  record	
  is	
  a	
  set	
  of	
  named	
  fields	
  where	
  each	
  field	
  has	
  
an	
  ordered	
  list	
  of	
  one	
  or	
  more	
  Java	
  Objects	
  (i.e.	
  
Guava’s	
  ArrayListMulUmap)	
  
•  Field	
  can	
  have	
  mulUple	
  values	
  and	
  any	
  two	
  records	
  
need	
  not	
  use	
  common	
  field	
  names	
  
•  Corresponds	
  exactly	
  to	
  Solr/Lucene	
  data	
  model	
  
•  Pass	
  not	
  only	
  structured	
  data,	
  but	
  also	
  arbitrary	
  
binary	
  data	
  
Passing	
  Binary	
  Data	
  
•  _abachment_body	
  field	
  (opUonal)	
  
•  java.io.InputStream	
  or	
  Java	
  byte[]	
  	
  
•  opUonal	
  fields	
  assist	
  w/	
  detecUng	
  &	
  parsing	
  data	
  type	
  
•  _abachment_mimetype	
  field	
  
•  e.g.	
  "applicaUon/pdf"	
  	
  
•  _abachment_charset	
  field	
  
•  e.g.	
  "UTF-­‐8"	
  
•  _abachment_name	
  field	
  
•  e.g.	
  "cars.pdf”	
  
•  Conceptually	
  similar	
  to	
  email	
  and	
  HTTP	
  headers/body	
  
Processing	
  Model	
  
•  Morphline	
  commands	
  manipulate	
  conUnuous	
  or	
  
arbitrarily	
  large	
  streams	
  of	
  records	
  
•  A	
  command	
  transforms	
  a	
  record	
  into	
  zero	
  or	
  more	
  
records	
  
•  The	
  output	
  records	
  of	
  a	
  command	
  are	
  passed	
  to	
  the	
  
next	
  command	
  in	
  the	
  chain	
  
•  A	
  command	
  can	
  contain	
  nested	
  commands	
  	
  
•  A	
  morphline	
  is	
  a	
  tree	
  of	
  commands,	
  essenUally	
  a	
  
push-­‐based	
  data	
  flow	
  engine	
  
Processing	
  Model	
  Non-­‐Goals	
  
•  Designed	
  to	
  embedded	
  into	
  mulUple	
  host	
  systems,	
  thus…	
  
•  No	
  noUon	
  of	
  persistence	
  or	
  durability	
  or	
  distributed	
  
compuUng	
  or	
  node	
  failover	
  
•  Basically	
  just	
  a	
  chain	
  of	
  in-­‐memory	
  transformaUons	
  in	
  the	
  
current	
  thread	
  
•  No	
  need	
  to	
  manage	
  mulUple	
  nodes	
  or	
  threads	
  -­‐	
  	
  already	
  
covered	
  by	
  host	
  systems	
  such	
  as	
  MapReduce,	
  Flume,	
  
Storm,	
  etc.	
  	
  
•  However,	
  a	
  morphline	
  does	
  support	
  passing	
  noUficaUons	
  
•  E.g.	
  BEGIN_TRANSACTION,	
  COMMIT_TRANSACTION,	
  
ROLLBACK_TRANSACTION,	
  SHUTDOWN	
  
Performance	
  and	
  Scaling	
  
•  The	
  runUme	
  compiles	
  morphline	
  on	
  the	
  fly	
  	
  
•  The	
  runUme	
  processes	
  all	
  commands	
  of	
  a	
  given	
  
morphline	
  in	
  the	
  same	
  thread	
  	
  
•  For	
  scalability,	
  deploy	
  many	
  morphline	
  instances	
  on	
  a	
  
cluster	
  in	
  many	
  Flume	
  agents	
  and	
  MapReduce	
  tasks	
  
Syntax	
  
•  HOCON	
  format	
  (Human-­‐OpUmized	
  Config	
  Object	
  
NotaUon)	
  
•  Basically	
  JSON	
  slightly	
  adjusted	
  for	
  the	
  configuraUon	
  
file	
  use	
  case	
  	
  
•  Came	
  out	
  of	
  typesafe.com	
  
•  Also	
  used	
  by	
  Akka	
  and	
  Play	
  frameworks	
  
Example:	
  Indexing	
  log4j	
  w/	
  stacktraces	
  
juil. 25, 2012 10:49:40 AM hudson.triggers.SafeTimerTask run ok
juil. 25, 2012 10:49:46 AM hudson.triggers.SafeTimerTask run failed
com.amazonaws.AmazonClientException: Unable to calculate a request signature
at com.amazonaws.auth.AbstractAWSSigner.signAndBase64Encode(AbstractAWSSigner.java:71)
at java.util.TimerThread.run(Timer.java:505)
Caused by: com.amazonaws.AmazonClientException: Unable to calculate a request signature
at com.amazonaws.auth.AbstractAWSSigner.sign(AbstractAWSSigner.java:90)
at com.amazonaws.auth.AbstractAWSSigner.signAndBase64Encode(AbstractAWSSigner.java:68)
... 14 more
Caused by: java.lang.IllegalArgumentException: Empty key
at javax.crypto.spec.SecretKeySpec.<init>(SecretKeySpec.java:96)
at com.amazonaws.auth.AbstractAWSSigner.sign(AbstractAWSSigner.java:87)
... 15 more
juil. 25, 2012 10:49:54 AM hudson.slaves.SlaveComputer tryReconnect
Record	
  1	
  
Record	
  2	
  
Record	
  3	
  
Example:	
  Indexing	
  log4j	
  w/	
  stacktraces	
  
morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{
readMultiLine {
regex : "(^.+Exception: .+)|(^s+at .+)|(^s+... d+ more)|(^s*Caused by:.+)"
what : previous
charset : UTF-8
}
}
{ logDebug { format : "output record: {}", args : ["@{}"] } }
{ loadSolr {}
]
}
]
Example:	
  Escape	
  to	
  Java	
  Code	
  
morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{ java
{
code: """
List tags = record.get("tags");
if (!tags.contains("hello")) {
return false;
}
tags.add("world");
return child.process(record);
"""
}
}
]
}
]
Current	
  Command	
  Library	
  
•  Integrate	
  with	
  and	
  load	
  into	
  Apache	
  Solr	
  
•  Flexible	
  log	
  file	
  analysis	
  
•  Single-­‐line	
  record,	
  mulU-­‐line	
  records,	
  CSV	
  files	
  	
  
•  Regex	
  based	
  pabern	
  matching	
  and	
  extracUon	
  	
  
•  IntegraUon	
  with	
  Avro,	
  JSON,	
  XML,	
  HTML	
  	
  
•  IntegraUon	
  with	
  Apache	
  Hadoop	
  Sequence	
  Files	
  
•  IntegraUon	
  with	
  SolrCell	
  and	
  all	
  Apache	
  Tika	
  parsers	
  	
  
•  Auto-­‐detecUon	
  of	
  MIME	
  types	
  from	
  binary	
  data	
  using	
  
Apache	
  Tika	
  
Current	
  Command	
  Library	
  (cont’d)	
  
•  ScripUng	
  support	
  for	
  dynamic	
  java	
  code	
  	
  
•  OperaUons	
  on	
  fields	
  for	
  assignment	
  and	
  comparison	
  
•  OperaUons	
  on	
  fields	
  with	
  list	
  and	
  set	
  semanUcs	
  	
  
•  if-­‐then-­‐else	
  condiUonals	
  	
  
•  A	
  small	
  rules	
  engine	
  (tryRules)	
  
•  String	
  and	
  Umestamp	
  conversions	
  	
  
•  slf4j	
  logging	
  
•  Yammer	
  metrics	
  and	
  counters	
  	
  
•  Decompression	
  and	
  unpacking	
  of	
  arbitrarily	
  nested	
  
container	
  file	
  formats	
  
•  etc	
  
Plugin	
  Commands	
  
•  Easy	
  to	
  add	
  new	
  I/O	
  &	
  transformaUon	
  cmds	
  	
  
•  Integrate	
  exisUng	
  funcUonality	
  and	
  third	
  party	
  
systems	
  
•  Implement	
  Java	
  interface	
  Command	
  or	
  subclass	
  
AbstractCommand
•  Add	
  it	
  to	
  Java	
  classpath	
  
•  No	
  registraUon	
  or	
  other	
  administraUve	
  acUon	
  
required	
  
Morphline	
  Example	
  –	
  syslog	
  with	
  grok	
  
morphlines	
  :	
  [	
  
	
  {	
  
	
  	
  	
  id	
  :	
  morphline1	
  
	
  	
  	
  importCommands	
  :	
  ["com.cloudera.**",	
  "org.apache.solr.**"]	
  
	
  	
  	
  commands	
  :	
  [	
  
	
  	
  	
  	
  	
  {	
  readLine	
  {}	
  }	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  grok	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  dicUonaryFiles	
  :	
  [/tmp/grok-­‐dicUonaries]	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  expressions	
  :	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  message	
  :	
  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Umestamp}	
  %
{SYSLOGHOST:syslog_hostname}	
  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:	
  %
{GREEDYDATA:syslog_message}"""	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  {	
  loadSolr	
  {}	
  }	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  ]	
  
	
  }	
  
]	
  
Example	
  Input	
  
<164>Feb	
  	
  4	
  10:46:14	
  syslog	
  sshd[607]:	
  listening	
  on	
  0.0.0.0	
  port	
  22	
  
Output	
  Record	
  
syslog_pri:164	
  
syslog_Umestamp:Feb	
  	
  4	
  10:46:14	
  
syslog_hostname:syslog	
  
syslog_program:sshd	
  
syslog_pid:607	
  
syslog_message:listening	
  on	
  0.0.0.0	
  port	
  22.	
  
PotenUal	
  New	
  Plugin	
  Commands	
  
•  Extract,	
  clean,	
  transform,	
  join,	
  integrate,	
  enrich	
  and	
  
decorate	
  records	
  
•  Examples	
  
•  join	
  records	
  with	
  external	
  data	
  sources	
  such	
  as	
  relaUonal	
  
databases,	
  key-­‐value	
  stores,	
  local	
  files	
  or	
  IP	
  Geo	
  lookup	
  
tables.	
  	
  
•  Perform	
  DNS	
  resoluUon,	
  expand	
  shortened	
  URLs	
  
•  fetch	
  linked	
  metadata	
  from	
  social	
  networks	
  
•  do	
  senUment	
  analysis	
  &	
  annotate	
  record	
  accordingly	
  
•  conUnuously	
  maintain	
  stats	
  over	
  sliding	
  windows	
  
•  compute	
  exact	
  or	
  approx.	
  disUnct	
  values	
  &	
  quanUles	
  
Use	
  Case:	
  Cloudera	
  Search	
  
An	
  Integrated	
  Part	
  of	
  
the	
  Hadoop	
  System	
  
One	
  pool	
  of	
  data	
  
One	
  security	
  framework	
  
One	
  set	
  of	
  system	
  resources	
  
One	
  management	
  interface	
  
What	
  is	
  Cloudera	
  Search?	
  
•  Full-­‐text,	
  interacUve	
  search	
  and	
  faceted	
  navigaUon	
  
•  Batch,	
  near	
  real-­‐Ume,	
  and	
  on-­‐demand	
  indexing	
  
•  Apache	
  Solr	
  integrated	
  with	
  CDH	
  
•  Established,	
  mature	
  search	
  with	
  vibrant	
  community	
  
•  Separate	
  runUme	
  like	
  MapReduce,	
  Impala	
  
•  Incorporated	
  as	
  part	
  of	
  the	
  Hadoop	
  ecosystem	
  
•  Open	
  Source	
  
•  100%	
  Apache,	
  100%	
  Solr	
  
•  Standard	
  Solr	
  APIs	
  
ETL	
  for	
  Distributed	
  Search	
  on	
  Apache	
  Hadoop	
  
Flume	
  
Hue	
  UI	
  
Custom	
  
UI	
  
Custom	
  
App	
  
Solr	
  
Solr	
  
Solr	
  
SolrCloud	
  
query	
  
query	
  
query	
  
Index	
  
(ETL)	
  
Hadoop	
  Cluster	
  
MR	
  
HDFS	
  
Index	
  
(ETL)	
  
HBase	
  
Index	
  
(ETL)	
  
Near	
  Real	
  Time	
  ETL	
  &	
  Indexing	
  with	
  Flume	
  
Log	
  File	
  
Apache	
  Solr	
  and	
  
Apache	
  Flume	
  
•  Data	
  ingest	
  at	
  scale	
  
•  Flexible	
  extracUon	
  and	
  
mapping	
  
•  Indexing	
  at	
  data	
  ingest	
  
•  Packaged	
  as	
  Flume	
  
Morphline	
  Solr	
  Sink	
  
HDFS	
  
Flume	
  
Agent	
  
Indexer	
  w/	
  
Morphline	
  
Other	
  Log	
  File	
  
Flume	
  
Agent	
  
Indexer	
  w/	
  
Morphline	
  
26	
  
agent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent.sinks.solrSink.morphlineFile = /etc/flume-ng/conf/morphline.conf
Flume.conf	
  
Cloudera	
  Manager	
  Flume	
  Morphline	
  GUI	
  
27
Scalable	
  Batch	
  ETL	
  &	
  Indexing	
  
Index	
  
shard	
  
Files	
  
Index	
  
shard	
  
Indexer	
  w/	
  
Morphline	
  
Files	
  
Solr	
  
server	
  
Indexer	
  w/	
  
Morphline	
  
Solr	
  
server	
  
28
HDFS	
  
Solr	
  and	
  MapReduce	
  
•  Flexible,	
  scalable	
  batch	
  
indexing	
  
•  Start	
  serving	
  new	
  indices	
  
with	
  no	
  downUme	
  
•  On-­‐demand	
  indexing,	
  cost-­‐
efficient	
  re-­‐indexing	
  
•  Packaged	
  as	
  
MapReduceIndexerTool	
  
hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
MapReduceIndexerTool	
  
29
hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
S0_0_0
Extractors
(Mappers)
Leaf Shards
(Reducers)
Root Shards
(Mappers)
S0_0_1
S0S0_1_0
S0_1_1
S1_0_0
S1_0_1
S1S1_1_0
S1_1_1
Input
Files
...
...
...
...
•  Morphline	
  runs	
  inside	
  Mapper	
  
Near	
  Real	
  Time	
  indexing	
  of	
  Apache	
  HBase	
  
HDFS	
  
HBase	
  
interacUve	
  load	
  
Lily	
  HBase	
  
Indexer(s)	
  
with	
  
Morphline	
  
Triggers	
  on	
  
updates	
  
Solr	
  server	
  
Solr	
  server	
  
Solr	
  server	
  
Solr	
  server	
  
Solr	
  server	
  
Search	
  
+	
   =	
  
Large	
  scale	
  tabular	
  data	
  
immediate	
  access	
  &	
  updates	
  
fast	
  &	
  flexible	
  informaDon	
  
discovery	
  
BIG	
  DATA	
  DATAMANAGEMENT	
  
Batch	
  &	
  Near	
  Real	
  Time	
  ETL	
  
Tweets
Flume Solr
Hue UI
HDFS
MapReduceIndexerTool, Impala, HBase, Mahout, EDW, MR, etc
Lily HBase Indexer
HdfsSink
Query
MapReduce
IndexerTool
Log Formats
Social Media
HTML
Images
PDF
Custom UI
Query
Custom App
...
Morphline
Morphline
MorphlineSink
Morphline
HBase
OLTP
What’s	
  next	
  
•  More	
  work	
  on	
  Apache	
  HBase	
  IntegraUon	
  
•  IntegraUon	
  into	
  Apache	
  Crunch	
  
•  Stream	
  AnalyUcs	
  
Conclusion	
  
•  Cloudera	
  Development	
  Kit	
  w/	
  Morphlines	
  	
  
•  Open	
  Source	
  -­‐	
  ASL	
  License	
  
•  Version	
  0.4.1	
  shipping	
  
•  Extensive	
  documentaUon	
  
•  Send	
  your	
  quesUons	
  and	
  feedback	
  to	
  cdk-­‐dev	
  mailing	
  list	
  
•  Also	
  ships	
  integrated	
  with	
  Cloudera	
  Search	
  
•  Free	
  QuickStart	
  VM	
  also	
  available!	
  

Mais conteúdo relacionado

Mais procurados

Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsySearch-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsyLucidworks
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Past, Present, and Future of Apache Storm
Past, Present, and Future of Apache StormPast, Present, and Future of Apache Storm
Past, Present, and Future of Apache StormP. Taylor Goetz
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered LuceneErik Hatcher
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query ParsingErik Hatcher
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Lucidworks
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrlucenerevolution
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Alexandre Rafalovitch
 

Mais procurados (20)

Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsySearch-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Past, Present, and Future of Apache Storm
Past, Present, and Future of Apache StormPast, Present, and Future of Apache Storm
Past, Present, and Future of Apache Storm
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
 

Semelhante a Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetupgregchanan
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 
Vorontsov, golovko ssrf attacks and sockets. smorgasbord of vulnerabilities
Vorontsov, golovko   ssrf attacks and sockets. smorgasbord of vulnerabilitiesVorontsov, golovko   ssrf attacks and sockets. smorgasbord of vulnerabilities
Vorontsov, golovko ssrf attacks and sockets. smorgasbord of vulnerabilitiesDefconRussia
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!gagravarr
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingFabian Hueske
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrTimothy Spann
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413gregchanan
 

Semelhante a Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek (20)

Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Vorontsov, golovko ssrf attacks and sockets. smorgasbord of vulnerabilities
Vorontsov, golovko   ssrf attacks and sockets. smorgasbord of vulnerabilitiesVorontsov, golovko   ssrf attacks and sockets. smorgasbord of vulnerabilities
Vorontsov, golovko ssrf attacks and sockets. smorgasbord of vulnerabilities
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Neptune @ SoCal
Neptune @ SoCalNeptune @ SoCal
Neptune @ SoCal
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Laravel ppt
Laravel pptLaravel ppt
Laravel ppt
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Follow the White Rabbit - Message Queues with PHP
Follow the White Rabbit - Message Queues with PHPFollow the White Rabbit - Message Queues with PHP
Follow the White Rabbit - Message Queues with PHP
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 

Mais de Hakka Labs

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchHakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceHakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartHakka Labs
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleHakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataHakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringHakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesHakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityHakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 

Mais de Hakka Labs (20)

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek

  • 1. 1 Using  Morphlines  for  on-­‐the-­‐fly  ETL   Wolfgang  Hoschek  (@whoschek)   SF  Data  Engineering  Meetup  July  2013  
  • 2. Agenda   •  Big  Data,  ETL  and  Search  –  seMng  the  stage   •  Cloudera  Morphlines  Architecture   •  Component  Deep  Dive   •  Cloudera  Search  Use  Cases   •  What’s  next?   Feel  free  to  ask  quesUons  as  we  go!  
  • 3. Example  ETL  Use  Case:     Distributed  Search  on  Hadoop   Flume   Hue  UI   Custom   UI   Custom   App   Solr   Solr   Solr   SolrCloud   query   query   query   Index   (ETL)   Hadoop  Cluster   MR   HDFS   Index   (ETL)   HBase   Index   (ETL)  
  • 4. Cloudera  Morphlines  Architecture   Solr   Solr   Solr   SolrCloud   Logs,  tweets,  social   media,  html,   images,  pdf,  text….   Anything  you  want   to  index   Flume,  MR  Indexer,  HBase  indexer,  etc...    Or  your  applicaUon!   Morphline  Library   Morphlines  can  be  embedded  in  any  applicaUon…   Your  App!  
  • 5. Cloudera  Morphlines   •  Open  Source  framework  for  simple  ETL   •  Consume  any  kind  of  data  from  any  kind  of  data  source,  process  and   load  into  any  app  or  storage  system   •  Designed  for  Near  Real  Time  apps  &  Batch  apps   •  Ships  as  part  Cloudera  Developer  Kit  (CDK)  and  Cloudera  Search   •  It’s  a  Java  library   •  ASL  licensed  on  github  hbps://github.com/cloudera/cdk   •  Similar  to  Unix  pipelines,  but  more  convenient  &  efficient   •  ConfiguraUon  over  coding  (reduce  Ume  &  skills)   •  Supports  common  file  formats   •  Log  Files  &  Text   •  Avro,  Sequence  file   •  JSON,  HTML  &  XML   •  Etc…  (pluggable)   •  Extensible  set  of  transformaUon  commands  
  • 6. ExtracUon,  TransformaUon  and  Loading   •  Chain  of  pipelined   commands   •  Simple  and  flexible  data   mapping  &  transformaUon     •  Reusable  across  mulUple   index  workloads   •  Over  Ume,  extend  and  re-­‐ use  across  plagorm   workloads   syslog   Flume   Agent   Solr  sink   Command:  readLine   Command:  grok   Command:  loadSolr   Solr   Event   Record   Record   Record   Document   Morphline  Library  
  • 7. Like  a  Unix  Pipeline   •  Like  Unix  pipelines  where  the  data  model  is   generalized  to  work  with  streams  of  generic  records,   including  arbitrary  binary  payloads   •  Designed  to  be  embedded  into  Hadoop  components   such  as  Search,  Flume,  MapReduce,  Pig,  Hive,  Sqoop  
  • 8. Stdlib  +  plugins   •  Framework  ships  with  a  set  of  frequently  used  high   level  transformaUon  and  I/O  commands  that  can  be   combined  in  applicaUon  specific  ways   •  The  plugin  system  allows  the  adding  of  new   transformaUons  and  I/O  commands  and  integrates   exisUng  funcUonality  and  third  party  systems  in  a   straighgorward  manne  
  • 9. Flexible  Data  Model   •  A  record  is  a  set  of  named  fields  where  each  field  has   an  ordered  list  of  one  or  more  Java  Objects  (i.e.   Guava’s  ArrayListMulUmap)   •  Field  can  have  mulUple  values  and  any  two  records   need  not  use  common  field  names   •  Corresponds  exactly  to  Solr/Lucene  data  model   •  Pass  not  only  structured  data,  but  also  arbitrary   binary  data  
  • 10. Passing  Binary  Data   •  _abachment_body  field  (opUonal)   •  java.io.InputStream  or  Java  byte[]     •  opUonal  fields  assist  w/  detecUng  &  parsing  data  type   •  _abachment_mimetype  field   •  e.g.  "applicaUon/pdf"     •  _abachment_charset  field   •  e.g.  "UTF-­‐8"   •  _abachment_name  field   •  e.g.  "cars.pdf”   •  Conceptually  similar  to  email  and  HTTP  headers/body  
  • 11. Processing  Model   •  Morphline  commands  manipulate  conUnuous  or   arbitrarily  large  streams  of  records   •  A  command  transforms  a  record  into  zero  or  more   records   •  The  output  records  of  a  command  are  passed  to  the   next  command  in  the  chain   •  A  command  can  contain  nested  commands     •  A  morphline  is  a  tree  of  commands,  essenUally  a   push-­‐based  data  flow  engine  
  • 12. Processing  Model  Non-­‐Goals   •  Designed  to  embedded  into  mulUple  host  systems,  thus…   •  No  noUon  of  persistence  or  durability  or  distributed   compuUng  or  node  failover   •  Basically  just  a  chain  of  in-­‐memory  transformaUons  in  the   current  thread   •  No  need  to  manage  mulUple  nodes  or  threads  -­‐    already   covered  by  host  systems  such  as  MapReduce,  Flume,   Storm,  etc.     •  However,  a  morphline  does  support  passing  noUficaUons   •  E.g.  BEGIN_TRANSACTION,  COMMIT_TRANSACTION,   ROLLBACK_TRANSACTION,  SHUTDOWN  
  • 13. Performance  and  Scaling   •  The  runUme  compiles  morphline  on  the  fly     •  The  runUme  processes  all  commands  of  a  given   morphline  in  the  same  thread     •  For  scalability,  deploy  many  morphline  instances  on  a   cluster  in  many  Flume  agents  and  MapReduce  tasks  
  • 14. Syntax   •  HOCON  format  (Human-­‐OpUmized  Config  Object   NotaUon)   •  Basically  JSON  slightly  adjusted  for  the  configuraUon   file  use  case     •  Came  out  of  typesafe.com   •  Also  used  by  Akka  and  Play  frameworks  
  • 15. Example:  Indexing  log4j  w/  stacktraces   juil. 25, 2012 10:49:40 AM hudson.triggers.SafeTimerTask run ok juil. 25, 2012 10:49:46 AM hudson.triggers.SafeTimerTask run failed com.amazonaws.AmazonClientException: Unable to calculate a request signature at com.amazonaws.auth.AbstractAWSSigner.signAndBase64Encode(AbstractAWSSigner.java:71) at java.util.TimerThread.run(Timer.java:505) Caused by: com.amazonaws.AmazonClientException: Unable to calculate a request signature at com.amazonaws.auth.AbstractAWSSigner.sign(AbstractAWSSigner.java:90) at com.amazonaws.auth.AbstractAWSSigner.signAndBase64Encode(AbstractAWSSigner.java:68) ... 14 more Caused by: java.lang.IllegalArgumentException: Empty key at javax.crypto.spec.SecretKeySpec.<init>(SecretKeySpec.java:96) at com.amazonaws.auth.AbstractAWSSigner.sign(AbstractAWSSigner.java:87) ... 15 more juil. 25, 2012 10:49:54 AM hudson.slaves.SlaveComputer tryReconnect Record  1   Record  2   Record  3  
  • 16. Example:  Indexing  log4j  w/  stacktraces   morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readMultiLine { regex : "(^.+Exception: .+)|(^s+at .+)|(^s+... d+ more)|(^s*Caused by:.+)" what : previous charset : UTF-8 } } { logDebug { format : "output record: {}", args : ["@{}"] } } { loadSolr {} ] } ]
  • 17. Example:  Escape  to  Java  Code   morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { java { code: """ List tags = record.get("tags"); if (!tags.contains("hello")) { return false; } tags.add("world"); return child.process(record); """ } } ] } ]
  • 18. Current  Command  Library   •  Integrate  with  and  load  into  Apache  Solr   •  Flexible  log  file  analysis   •  Single-­‐line  record,  mulU-­‐line  records,  CSV  files     •  Regex  based  pabern  matching  and  extracUon     •  IntegraUon  with  Avro,  JSON,  XML,  HTML     •  IntegraUon  with  Apache  Hadoop  Sequence  Files   •  IntegraUon  with  SolrCell  and  all  Apache  Tika  parsers     •  Auto-­‐detecUon  of  MIME  types  from  binary  data  using   Apache  Tika  
  • 19. Current  Command  Library  (cont’d)   •  ScripUng  support  for  dynamic  java  code     •  OperaUons  on  fields  for  assignment  and  comparison   •  OperaUons  on  fields  with  list  and  set  semanUcs     •  if-­‐then-­‐else  condiUonals     •  A  small  rules  engine  (tryRules)   •  String  and  Umestamp  conversions     •  slf4j  logging   •  Yammer  metrics  and  counters     •  Decompression  and  unpacking  of  arbitrarily  nested   container  file  formats   •  etc  
  • 20. Plugin  Commands   •  Easy  to  add  new  I/O  &  transformaUon  cmds     •  Integrate  exisUng  funcUonality  and  third  party   systems   •  Implement  Java  interface  Command  or  subclass   AbstractCommand •  Add  it  to  Java  classpath   •  No  registraUon  or  other  administraUve  acUon   required  
  • 21. Morphline  Example  –  syslog  with  grok   morphlines  :  [    {        id  :  morphline1        importCommands  :  ["com.cloudera.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dicUonaryFiles  :  [/tmp/grok-­‐dicUonaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Umestamp}  % {SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:  % {GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }   ]   Example  Input   <164>Feb    4  10:46:14  syslog  sshd[607]:  listening  on  0.0.0.0  port  22   Output  Record   syslog_pri:164   syslog_Umestamp:Feb    4  10:46:14   syslog_hostname:syslog   syslog_program:sshd   syslog_pid:607   syslog_message:listening  on  0.0.0.0  port  22.  
  • 22. PotenUal  New  Plugin  Commands   •  Extract,  clean,  transform,  join,  integrate,  enrich  and   decorate  records   •  Examples   •  join  records  with  external  data  sources  such  as  relaUonal   databases,  key-­‐value  stores,  local  files  or  IP  Geo  lookup   tables.     •  Perform  DNS  resoluUon,  expand  shortened  URLs   •  fetch  linked  metadata  from  social  networks   •  do  senUment  analysis  &  annotate  record  accordingly   •  conUnuously  maintain  stats  over  sliding  windows   •  compute  exact  or  approx.  disUnct  values  &  quanUles  
  • 23. Use  Case:  Cloudera  Search   An  Integrated  Part  of   the  Hadoop  System   One  pool  of  data   One  security  framework   One  set  of  system  resources   One  management  interface  
  • 24. What  is  Cloudera  Search?   •  Full-­‐text,  interacUve  search  and  faceted  navigaUon   •  Batch,  near  real-­‐Ume,  and  on-­‐demand  indexing   •  Apache  Solr  integrated  with  CDH   •  Established,  mature  search  with  vibrant  community   •  Separate  runUme  like  MapReduce,  Impala   •  Incorporated  as  part  of  the  Hadoop  ecosystem   •  Open  Source   •  100%  Apache,  100%  Solr   •  Standard  Solr  APIs  
  • 25. ETL  for  Distributed  Search  on  Apache  Hadoop   Flume   Hue  UI   Custom   UI   Custom   App   Solr   Solr   Solr   SolrCloud   query   query   query   Index   (ETL)   Hadoop  Cluster   MR   HDFS   Index   (ETL)   HBase   Index   (ETL)  
  • 26. Near  Real  Time  ETL  &  Indexing  with  Flume   Log  File   Apache  Solr  and   Apache  Flume   •  Data  ingest  at  scale   •  Flexible  extracUon  and   mapping   •  Indexing  at  data  ingest   •  Packaged  as  Flume   Morphline  Solr  Sink   HDFS   Flume   Agent   Indexer  w/   Morphline   Other  Log  File   Flume   Agent   Indexer  w/   Morphline   26   agent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink agent.sinks.solrSink.morphlineFile = /etc/flume-ng/conf/morphline.conf Flume.conf  
  • 27. Cloudera  Manager  Flume  Morphline  GUI   27
  • 28. Scalable  Batch  ETL  &  Indexing   Index   shard   Files   Index   shard   Indexer  w/   Morphline   Files   Solr   server   Indexer  w/   Morphline   Solr   server   28 HDFS   Solr  and  MapReduce   •  Flexible,  scalable  batch   indexing   •  Start  serving  new  indices   with  no  downUme   •  On-­‐demand  indexing,  cost-­‐ efficient  re-­‐indexing   •  Packaged  as   MapReduceIndexerTool   hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ...
  • 29. MapReduceIndexerTool   29 hadoop ... MapReduceIndexerTool --morphline-file morphline.conf ... S0_0_0 Extractors (Mappers) Leaf Shards (Reducers) Root Shards (Mappers) S0_0_1 S0S0_1_0 S0_1_1 S1_0_0 S1_0_1 S1S1_1_0 S1_1_1 Input Files ... ... ... ... •  Morphline  runs  inside  Mapper  
  • 30. Near  Real  Time  indexing  of  Apache  HBase   HDFS   HBase   interacUve  load   Lily  HBase   Indexer(s)   with   Morphline   Triggers  on   updates   Solr  server   Solr  server   Solr  server   Solr  server   Solr  server   Search   +   =   Large  scale  tabular  data   immediate  access  &  updates   fast  &  flexible  informaDon   discovery   BIG  DATA  DATAMANAGEMENT  
  • 31. Batch  &  Near  Real  Time  ETL   Tweets Flume Solr Hue UI HDFS MapReduceIndexerTool, Impala, HBase, Mahout, EDW, MR, etc Lily HBase Indexer HdfsSink Query MapReduce IndexerTool Log Formats Social Media HTML Images PDF Custom UI Query Custom App ... Morphline Morphline MorphlineSink Morphline HBase OLTP
  • 32. What’s  next   •  More  work  on  Apache  HBase  IntegraUon   •  IntegraUon  into  Apache  Crunch   •  Stream  AnalyUcs  
  • 33. Conclusion   •  Cloudera  Development  Kit  w/  Morphlines     •  Open  Source  -­‐  ASL  License   •  Version  0.4.1  shipping   •  Extensive  documentaUon   •  Send  your  quesUons  and  feedback  to  cdk-­‐dev  mailing  list   •  Also  ships  integrated  with  Cloudera  Search   •  Free  QuickStart  VM  also  available!