SlideShare uma empresa Scribd logo
1 de 56
Baixar para ler offline
Apache Flume
Getting Logs/Data to Hadoop
Steve Hoffman
Chicago Hadoop User Group (CHUG)
2014-04-09T10:30:00Z
About Me
• Steve Hoffman
• twitter: @bacoboy

else: http://bit.ly/bacoboy
About Me
• Steve Hoffman
• twitter: @bacoboy

else: http://bit.ly/bacoboy
• Tech Guy @Orbitz
About Me
• Steve Hoffman
• twitter: @bacoboy

else: http://bit.ly/bacoboy
• Tech Guy @Orbitz
• Wrote a book on Flume
Why do I need Flume?
• Created to deal with streaming data/logs to HDFS
• Can’t mount HDFS (usually)
• Can’t “copy” to files to HDFS if the files aren’t closed (aka
log files)
• Need to buffer “some”, then write and close a file — repeat
• May involve multiple hops due to topology (# of machines,
datacenter separation, etc).
• A lot can go wrong here…
Agent
• Java daemon
• Has a name (usually ‘agent’)
• Receive data from sources and write events to 1
or more channels
• Move events from 1 channel to sink. Remove
from channel if successfully written.
Events
• Headers = Key/Value Pairs — Map<String,	
  String>
• Body = byte array — byte[]
• For example:
10.10.1.1 - - [29/Jan/2014:03:36:04 -0600] "HEAD /ping.html
HTTP/1.1" 200 0 "-" "-" “-"!
{“timestamp”:”1391986793111”, “host”:”server1.example.com”}
31302e31302e312e31202d202d205b32392f4a616e2f323031343a30333a33
363a3034202d303630305d202248454144202f70696e672e68746d6c204854
54502f312e312220323030203020222d2220222d2220222d22
Channels
• Place to hold Events
• Memory or File Backed (also JDBC, but why?)
• Bounded - Size is configurable
• Resources aren’t infinite
Sources
• Feeds data to one or more Channels
• Usually data pushed to it (listen for data on a
socket. i.e. HTTP Source) or from Avro log4J
appender.
• Or can periodically poll another system and
generate events (i.e. run a command every minute,
and parse output into Event, Query a DB/Mongo/
etc.)
Sinks
• Move Events from a single Channel to a
destination
• Only removes from Channel if write successful
• HDFSSink you’ll use the most

— most likely…
Configuration Sample
# Agent named ‘agent’!
# Input (source)!
agent.sources.r1.type = seq!
agent.sources.r1.channels = c1!
!
# Output (sink)!
agent.sinks.k1.type = logger!
agent.sinks.k1.channel = c1!
!
# Channel!
agent.channels.c1.type = memory!
agent.channels.c1.capacity = 1000!
!
# Wire everything together!
agent.sources = r1!
agent.sinks = k1!
agent.channels = c1!
Startup
# Agent named ‘agent’!
# Input (source)!
agent.sources.r1.type = seq!
agent.sources.r1.channels = c1!
!
# Output (sink)!
agent.sinks.k1.type = logger!
agent.sinks.k1.channel = c1!
!
# Channel!
agent.channels.c1.type = memory!
agent.channels.c1.capacity = 1000!
!
# Wire everything together!
agent.sources = r1!
agent.sinks = k1!
agent.channels = c1!
name.{sources|sinks|channels}
Startup
# Agent named ‘agent’!
# Input (source)!
agent.sources.r1.type = seq!
agent.sources.r1.channels = c1!
!
# Output (sink)!
agent.sinks.k1.type = logger!
agent.sinks.k1.channel = c1!
!
# Channel!
agent.channels.c1.type = memory!
agent.channels.c1.capacity = 1000!
!
# Wire everything together!
agent.sources = r1!
agent.sinks = k1!
agent.channels = c1!
name.{sources|sinks|channels}
Find instance name + type
Startup
# Agent named ‘agent’!
# Input (source)!
agent.sources.r1.type = seq!
agent.sources.r1.channels = c1!
!
# Output (sink)!
agent.sinks.k1.type = logger!
agent.sinks.k1.channel = c1!
!
# Channel!
agent.channels.c1.type = memory!
agent.channels.c1.capacity = 1000!
!
# Wire everything together!
agent.sources = r1!
agent.sinks = k1!
agent.channels = c1!
name.{sources|sinks|channels}
Find instance name + type
Connect channel(s)
Startup
# Agent named ‘agent’!
# Input (source)!
agent.sources.r1.type = seq!
agent.sources.r1.channels = c1!
!
# Output (sink)!
agent.sinks.k1.type = logger!
agent.sinks.k1.channel = c1!
!
# Channel!
agent.channels.c1.type = memory!
agent.channels.c1.capacity = 1000!
!
# Wire everything together!
agent.sources = r1!
agent.sinks = k1!
agent.channels = c1!
name.{sources|sinks|channels}
Find instance name + type
Connect channel(s)
Apply type specific

configurations
Startup
# Agent named ‘agent’!
# Input (source)!
agent.sources.r1.type = seq!
agent.sources.r1.channels = c1!
!
# Output (sink)!
agent.sinks.k1.type = logger!
agent.sinks.k1.channel = c1!
!
# Channel!
agent.channels.c1.type = memory!
agent.channels.c1.capacity = 1000!
!
# Wire everything together!
agent.sources = r1!
agent.sinks = k1!
agent.channels = c1!
name.{sources|sinks|channels}
Find instance name + type
Connect channel(s)
Apply type specific

configurations
RTM - Flume User Guide

https://flume.apache.org/FlumeUserGuide.html"
or my book :)
Configuration Sample (logs)
Creating channels!
Creating instance of channel c1 type memory!
Created channel c1!
Creating instance of source r1, type seq!
Creating instance of sink: k1, type: logger!
Channel c1 connected to [r1, k1]!
Starting new configuration:{ sourceRunners:{r1=PollableSourceRunner:
{ source:org.apache.flume.source.SequenceGeneratorSource{name:r1,state:IDLE}
counterGroup:{ name:null counters:{} } }} sinkRunners:{k1=SinkRunner:
{ policy:org.apache.flume.sink.DefaultSinkProcessor@19484a05 counterGroup:{ name:null
counters:{} } }} channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} }!
Event: { headers:{} body: 30 0 }!
Event: { headers:{} body: 31 1 }!
Event: { headers:{} body: 32 2 }!
and so on…
Using Cloudera Manager
• Same stuff, just in

a GUI
• Centrally managed in a
Database (instead of
source control/Git)
• Distributed from central
location (instead of
Chef/Puppet)
Multiple destinations need

multiple channels
Channel Selector
• When more than 1 channel specified on Source
• Replicating (Each channel gets a copy) - default
• Multiplexing (Channel picked based on a header
value)
• Custom (If these don’t work for you - code one!)
Channel Selector

Replicating
• Copy sent to all channels associated with Source
agent.sources.r1.selector.type=replicating

agent.sources.r1.channels=c1	
  c2	
  c3	
  
• Can specify “optional” channels
agent.sources.r1.selector.optional=c3	
  
• Transaction success if all non-optional channels
take the event (in this case c1 & c2)
Channel Selector

Multiplexing
• Copy sent to only some of the channels
agent.sources.r1.selector.type=multiplexing

agent.sources.r1.channels=c1	
  c2	
  c3	
  c4	
  
• Switch based on header key 

(i.e. {“currency”:“USD”} → c1)
agent.sources.r1.selector.header=currency

agent.sources.r1.selector.mapping.USD=c1

agent.sources.r1.selector.mapping.EUR=c2	
  c3

agent.sources.r1.selector.default=c4
Interceptors
• Zero or more on Source (before written to channel)
• Zero or more on Sink (after read from channel)
• Or Both
• Use for transformations of data in-flight (headers OR body)
public	
  Event	
  intercept(Event	
  event);

public	
  List<Event>	
  intercept(List<Event>	
  events);	
  
• Return null or empty List to drop Events
Interceptor Chaining
• Processed in Order Listed in Configuration (source r1 example):
agent.sources.r1.interceptors=i1	
  i2	
  i3

agent.sources.r1.interceptors.i1.type=timestamp

agent.sources.r1.interceptors.i1.preserveExisting=true

agent.sources.r1.interceptors.i2.type=static

agent.sources.r1.interceptors.i2.key=datacenter

agent.sources.r1.interceptors.i2.value=CHI

agent.sources.r1.interceptors.i3.type=host

agent.sources.r1.interceptors.i3.hostHeader=relay

agent.sources.r1.interceptors.i3.useIP=false	
  
• Resulting Headers added before writing to Channel:
{“timestamp”:“1392350333234”,	
  “datacenter”:“CHI”,	
  
“relay”:“flumebox.example.com”}
Morphlines
• Interceptor and Sink forms.
• See Cloudera Website/Blog
• Created to ease transforms and Cloudera Search/Flume integration.
• An example:
# convert the timestamp field to "yyyy-MM-dd'T'HH:mm:ss.SSSZ"

# The input may match one of "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"

# or "yyyy-MM-dd'T'HH:mm:ss" or "yyyy-MM-dd".

convertTimestamp {

field : timestamp

inputFormats : ["yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", 

"yyyy-MM-dd'T'HH:mm:ss", "yyyy-MM-dd"]

inputTimezone : America/Chicago

outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"

outputTimezone : UTC

}
Avro
• Apache Avro - Data Serialization
• http://avro.apache.org/
• Storage Format and Wire Protocol
• Self-Describing (schema written with the data)
• Supports Compression of Data (not container — so
MapReduce friendly — “splitable”)
• Binary friendly — Doesn’t require records separated by n
Avro Source/Sink
• Preferred inter-agent transport in Flume
• Simple Configuration (host + port for sink and port for source)
• Minimal transformation needed for Flume Events
• Version of Avro in client & server don’t need to match — only
payload versioning matters

(think protocol buffers vs Java serialization)
Avro Source/Sink Config
foo.sources=…

foo.channels=channel-­‐foo

foo.channels.channel-­‐foo.type=memory

foo.sinks=sink-­‐foo

foo.sinks.sink-­‐foo.channel=channel-­‐foo

foo.sinks.sink-­‐foo.type=avro

foo.sinks.sink-­‐foo.hostname=bar.example.com

foo.sinks.sink-­‐foo.port=12345

foo.sinks.sink-­‐foo.compression-­‐type=deflate	
  
bar.sources=datafromfoo

bar.sources.datafromfoo.type=avro

bar.sources.datafromfoo.bind=0.0.0.0

bar.sources.datafromfoo.port=12345

bar.sources.datafromfoo.compression-­‐type=deflate

bar.sources.datafromfoo.channels=channel-­‐bar

bar.channels=channel-­‐bar

bar.channels.channel-­‐bar.type=memory

bar.sinks=…
log4j Avro Sink
• Remember that Web

Server pushing data to

Source?
• Use the Flume Avro log4j appender!
• log level, category, etc. become headers in Event
• “message” String becomes the body
log4j Configuration
• log4j.properties sender (include flume-­‐ng-­‐sdk-­‐1.X.X.jar in project):
log4j.appender.flume=org.apache.flume.clients.log4jappender.Log4jAp
pender

log4j.appender.flume.Hostname=example.com

log4j.appender.flume.Port=12345

log4j.appender.flume.UnsafeMode=true



log4j.logger.org.example.MyClass=DEBUG,flume	
  
• flume avro receiver:
agent.sources=logs

agent.sources.logs.type=avro

agent.sources.logs.bind=0.0.0.0

agent.sources.logs.port=12345

agent.sources.logs.channels=…
Avro Client
• Send data to AvroSource from command line
• Run flume program with avro-­‐client instead of agent
parameter
$	
  bin/flume-­‐ng	
  avro-­‐client	
  -­‐H	
  server.example.com

	
  	
  	
  	
  -­‐p	
  12345	
  [-­‐F	
  input_file]	
  
• Each line of the file (or stdin if no file given) becomes an
event
• Useful for testing or injecting data from outside Flume sources
(ExecSource vs cronjob which pipes output to avro-­‐client).
HDFSSink
• Read from Channel and write 

to a file in HDFS in chunks
• Until 1 of 3 things happens:
• some amount of time elapses (rollInterval)
• some number of records have been written (rollCount)
• some size of data has been written (rollSize)
• Close that file and start a new one
HDFS Configuration
foo.sources=…

foo.channels=channel-­‐foo

foo.channels.channel-­‐foo.type=memory

foo.sinks=sink-­‐foo

foo.sinks.sink-­‐foo.channel=channel-­‐foo

foo.sinks.sink-­‐foo.type=hdfs

foo.sinks.sink-­‐foo.hdfs.path=hdfs://NN/data/%Y/%m/%d/%H

foo.sinks.sink-­‐foo.hdfs.rollInterval=60

foo.sinks.sink-­‐foo.hdfs.filePrefix=log

foo.sinks.sink-­‐foo.hdfs.fileSuffix=.avro

foo.sinks.sink-­‐foo.hdfs.inUsePrefix=_

foo.sinks.sink-­‐foo.serializer=avro_event

foo.sinks.sink-­‐foo.serializer.compressionCodec=snappy
HDFS writing…
drwxr-­‐x-­‐-­‐-­‐	
  	
  	
  -­‐	
  flume	
  flume	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0	
  2014-­‐02-­‐16	
  17:04	
  /data/2014/02/16/23

-­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐	
  	
  	
  3	
  flume	
  flume	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  0	
  2014-­‐02-­‐16	
  17:04	
  /data/2014/02/16/23/_log.1392591607925.avro.tmp

-­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐	
  	
  	
  3	
  flume	
  flume	
  	
  	
  	
  	
  	
  	
  1877	
  2014-­‐02-­‐16	
  17:01	
  /data/2014/02/16/23/log.1392591607923.avro

-­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐	
  	
  	
  3	
  flume	
  flume	
  	
  	
  	
  	
  	
  	
  1955	
  2014-­‐02-­‐16	
  17:02	
  /data/2014/02/16/23/log.1392591607924.avro

-­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐	
  	
  	
  3	
  flume	
  flume	
  	
  	
  	
  	
  	
  	
  2390	
  2014-­‐02-­‐16	
  17:04	
  /data/2014/02/16/23/log.1392591798436.avro	
  
• The zero length .tmp file is the current file. Won’t
see real size until it closes (just like when you do a
hadoop	
  fs	
  -­‐put)
• Use …hdfs.inUsePrefix=_ to prevent open files
from being included in MapReduce jobs
Event Serializers
• Defines how the Event gets written to Sink
• Just the body as a UTF-8 String
agent.sinks.foo-­‐sink.serializer=text	
  
• Headers and Body as UTF-8 String
agent.sinks.foo-­‐sink.serializer=header_and_text	
  
• Avro (Flume record Schema)
agent.sinks.foo-­‐sink-­‐serializer=avro_event	
  
• Custom (none of the above meets your needs)
Lessons Learned
Source: https://xkcd.com/1179/
Too Many…
Timezones are Evil
• Daylight savings time causes problems twice a year (in Spring: no 2am
hour. In Fall: twice the data during 2am hour — 02:15? Which one?)
• Date processing in MapReduce jobs: Hourly jobs, filters, etc.
• Dated paths: hdfs://NN/data/%Y/%m/%d/%H
• Use UTC: -­‐Duser.timezone=UTC	
  
• Use one of the ISO8601 formats like 2014-­‐02-­‐26T18:00:00.000Z
• Sorts the way you usually want
• Every time library supports it* - and if not, easy to parse.
Generally Speaking…
• Async handoff doesn’t work under load when bad
stuff happens
Write Read
Filesystem
or
Queue
or
Database
or whatever
Not ∞
Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log
Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log.1
Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log
foo.log.1
Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log
foo.log.2
Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log.1 foo.log.2
Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log.1 foo.log.2
foo.log
Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log.1 foo.log.2
foo.log
Async Handoff Oops
Flume
Agent
tail -F foo.log
foo.log.1 foo.log.2
foo.log
X
Don’t Use Tail
• Tailing a file for input is bad - assumptions are made that
aren’t guarantees.
• Direct support removed during Flume rewrite
• Handoff can go bad with files: when writer faster than
reader
• With Queue: when reader doesn’t read before expire time
• No way to apply “back pressure” to tell tail there is a
problem. It isn’t listening…
What can I use?
• If you can’t use the log4j Avro Appender…
• Use logrotate to move old logs to “spool” directory
• SpoolingDirectorySource
• Finally, cron job to remove .COMPLETED files (for
delayed delete) OR set deletePolicy=true
(immediate)
• Alternatively use log rotate with avro_client? (probably
other ways too…)
RAM or Disk Channels?
Source:

http://blog.scoutapp.com/articles/2011/02/10/
understanding-disk-i-o-when-should-you-be-worried
Duplicate Events
• Transactions only at Agent level
• You may see Events more than once
• Distributed Transactions are expensive
• Just deal with in query/scrub phase — much less
costly than trying to prevent it from happening
Late Data
• Data could be “late”/delayed
• Outages
• Restarts
• Act of Nature
• Only sure thing is a “database” — single write + ACK
• Depending on your monitoring, it could be REALLY
LATE.
Monitoring
• Know when it breaks so you can fix it before you can’t ingest new data
(and it is lost)
• This time window is small if volume is high
• Flume Monitoring still WIP, but hooks are there
Other Operational Concerns
• resource utilization - number of open files when
writing (file descriptors), disk space used for file
channel, disk contention, disk speed*
• number of inbound and outbound sockets - may
need to tier (Avro Source/Sink)
• minimize hops if possible - another place for data
to get stuck
Not everything is a nail
• Flume is great for handling individual records
• What if you need to compute an average?
• Get a Stream Processing system
• Storm (twitter’s)
• Samza (linkedIn’s)
• Others…
• Flume can co-exist with these — use most appropriate tool
Questions?
…and thanks!
Slides @ http://slideshare.net/bacoboy

Mais conteúdo relacionado

Mais procurados

Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutesdwmclary
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016Jayesh Thakrar
 
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Cloudera, Inc.
 
Query Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache FlinkQuery Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache FlinkStreamNative
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEkawamuray
 
HBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseHBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseMichael Stack
 
Flume with Twitter Integration
Flume with Twitter IntegrationFlume with Twitter Integration
Flume with Twitter IntegrationRockyCIce
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...JinfengHuang3
 
Load Balancing with Apache
Load Balancing with ApacheLoad Balancing with Apache
Load Balancing with ApacheBradley Holt
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperAnandMHadoop
 
Denial of Service Mitigation Tactics in FreeBSD
Denial of Service Mitigation Tactics in FreeBSDDenial of Service Mitigation Tactics in FreeBSD
Denial of Service Mitigation Tactics in FreeBSDSteven Kreuzer
 
What's new in apache pulsar 2.4.0
What's new in apache pulsar 2.4.0What's new in apache pulsar 2.4.0
What's new in apache pulsar 2.4.0StreamNative
 
Apache Apex - BufferServer
Apache Apex - BufferServerApache Apex - BufferServer
Apache Apex - BufferServerPradeep Dalvi
 
When apache pulsar meets apache flink
When apache pulsar meets apache flinkWhen apache pulsar meets apache flink
When apache pulsar meets apache flinkStreamNative
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and ThenSATOSHI TAGOMORI
 

Mais procurados (20)

Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
Flume intro-100715
Flume intro-100715Flume intro-100715
Flume intro-100715
 
Filesystems, RPC and HDFS
Filesystems, RPC and HDFSFilesystems, RPC and HDFS
Filesystems, RPC and HDFS
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
 
Query Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache FlinkQuery Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache Flink
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 
HBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseHBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBase
 
Flume with Twitter Integration
Flume with Twitter IntegrationFlume with Twitter Integration
Flume with Twitter Integration
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
 
Load Balancing with Apache
Load Balancing with ApacheLoad Balancing with Apache
Load Balancing with Apache
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
 
Denial of Service Mitigation Tactics in FreeBSD
Denial of Service Mitigation Tactics in FreeBSDDenial of Service Mitigation Tactics in FreeBSD
Denial of Service Mitigation Tactics in FreeBSD
 
What's new in apache pulsar 2.4.0
What's new in apache pulsar 2.4.0What's new in apache pulsar 2.4.0
What's new in apache pulsar 2.4.0
 
Apache Apex - BufferServer
Apache Apex - BufferServerApache Apex - BufferServer
Apache Apex - BufferServer
 
When apache pulsar meets apache flink
When apache pulsar meets apache flinkWhen apache pulsar meets apache flink
When apache pulsar meets apache flink
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 

Destaque

Enabling Microservices @Orbitz - DockerCon 2015
Enabling Microservices @Orbitz - DockerCon 2015Enabling Microservices @Orbitz - DockerCon 2015
Enabling Microservices @Orbitz - DockerCon 2015Steve Hoffman
 
Enabling Microservices @Orbitz - DevOpsDays Chicago 2015
Enabling Microservices @Orbitz - DevOpsDays Chicago 2015Enabling Microservices @Orbitz - DevOpsDays Chicago 2015
Enabling Microservices @Orbitz - DevOpsDays Chicago 2015Steve Hoffman
 
Enabling Microservice @ Orbitz - GOTO Chicago 2016
Enabling Microservice @ Orbitz - GOTO Chicago 2016Enabling Microservice @ Orbitz - GOTO Chicago 2016
Enabling Microservice @ Orbitz - GOTO Chicago 2016Steve Hoffman
 
TS-2614 - Jini™ Network Technology-Enabled Service-Oriented Architecture, A L...
TS-2614 - Jini™ Network Technology-Enabled Service-Oriented Architecture, A L...TS-2614 - Jini™ Network Technology-Enabled Service-Oriented Architecture, A L...
TS-2614 - Jini™ Network Technology-Enabled Service-Oriented Architecture, A L...Steve Hoffman
 
Enabling Microservices @Orbitz - Velocity Conf 2015
Enabling Microservices @Orbitz - Velocity Conf 2015Enabling Microservices @Orbitz - Velocity Conf 2015
Enabling Microservices @Orbitz - Velocity Conf 2015Steve Hoffman
 
Apache Flume NG
Apache Flume NGApache Flume NG
Apache Flume NGhuguk
 
Enabling Hybrid Workflows with Docker/Mesos @Orbitz
Enabling Hybrid Workflows with Docker/Mesos @OrbitzEnabling Hybrid Workflows with Docker/Mesos @Orbitz
Enabling Hybrid Workflows with Docker/Mesos @OrbitzSteve Hoffman
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Cloudera, Inc.
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...Cloudera, Inc.
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem GetInData
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014francelabs
 

Destaque (16)

Enabling Microservices @Orbitz - DockerCon 2015
Enabling Microservices @Orbitz - DockerCon 2015Enabling Microservices @Orbitz - DockerCon 2015
Enabling Microservices @Orbitz - DockerCon 2015
 
Enabling Microservices @Orbitz - DevOpsDays Chicago 2015
Enabling Microservices @Orbitz - DevOpsDays Chicago 2015Enabling Microservices @Orbitz - DevOpsDays Chicago 2015
Enabling Microservices @Orbitz - DevOpsDays Chicago 2015
 
Enabling Microservice @ Orbitz - GOTO Chicago 2016
Enabling Microservice @ Orbitz - GOTO Chicago 2016Enabling Microservice @ Orbitz - GOTO Chicago 2016
Enabling Microservice @ Orbitz - GOTO Chicago 2016
 
TS-2614 - Jini™ Network Technology-Enabled Service-Oriented Architecture, A L...
TS-2614 - Jini™ Network Technology-Enabled Service-Oriented Architecture, A L...TS-2614 - Jini™ Network Technology-Enabled Service-Oriented Architecture, A L...
TS-2614 - Jini™ Network Technology-Enabled Service-Oriented Architecture, A L...
 
Enabling Microservices @Orbitz - Velocity Conf 2015
Enabling Microservices @Orbitz - Velocity Conf 2015Enabling Microservices @Orbitz - Velocity Conf 2015
Enabling Microservices @Orbitz - Velocity Conf 2015
 
Apache Flume NG
Apache Flume NGApache Flume NG
Apache Flume NG
 
Avro intro
Avro introAvro intro
Avro intro
 
Enabling Hybrid Workflows with Docker/Mesos @Orbitz
Enabling Hybrid Workflows with Docker/Mesos @OrbitzEnabling Hybrid Workflows with Docker/Mesos @Orbitz
Enabling Hybrid Workflows with Docker/Mesos @Orbitz
 
Apache Avro and You
Apache Avro and YouApache Avro and You
Apache Avro and You
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
 

Semelhante a Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Service discovery like a pro (presented at reversimX)
Service discovery like a pro (presented at reversimX)Service discovery like a pro (presented at reversimX)
Service discovery like a pro (presented at reversimX)Eran Harel
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeYahoo Developer Network
 
LCA2014 - Introduction to Go
LCA2014 - Introduction to GoLCA2014 - Introduction to Go
LCA2014 - Introduction to Godreamwidth
 
Distributed tracing with erlang/elixir
Distributed tracing with erlang/elixirDistributed tracing with erlang/elixir
Distributed tracing with erlang/elixirIvan Glushkov
 
Terraform in deployment pipeline
Terraform in deployment pipelineTerraform in deployment pipeline
Terraform in deployment pipelineAnton Babenko
 
Getting Data into Splunk
Getting Data into SplunkGetting Data into Splunk
Getting Data into SplunkSplunk
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
OSDC 2018 | Distributed Monitoring by Gianluca Arbezzano
OSDC 2018 | Distributed Monitoring by Gianluca ArbezzanoOSDC 2018 | Distributed Monitoring by Gianluca Arbezzano
OSDC 2018 | Distributed Monitoring by Gianluca ArbezzanoNETWAYS
 
OSDC 2018 - Distributed monitoring
OSDC 2018 - Distributed monitoringOSDC 2018 - Distributed monitoring
OSDC 2018 - Distributed monitoringGianluca Arbezzano
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrTimothy Spann
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...DevOps_Fest
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 

Semelhante a Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014 (20)

Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Service discovery like a pro (presented at reversimX)
Service discovery like a pro (presented at reversimX)Service discovery like a pro (presented at reversimX)
Service discovery like a pro (presented at reversimX)
 
Streams
StreamsStreams
Streams
 
Kubernetes debug like a pro
Kubernetes debug like a proKubernetes debug like a pro
Kubernetes debug like a pro
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
 
LCA2014 - Introduction to Go
LCA2014 - Introduction to GoLCA2014 - Introduction to Go
LCA2014 - Introduction to Go
 
Distributed tracing with erlang/elixir
Distributed tracing with erlang/elixirDistributed tracing with erlang/elixir
Distributed tracing with erlang/elixir
 
Terraform in deployment pipeline
Terraform in deployment pipelineTerraform in deployment pipeline
Terraform in deployment pipeline
 
Getting Data into Splunk
Getting Data into SplunkGetting Data into Splunk
Getting Data into Splunk
 
Python, do you even async?
Python, do you even async?Python, do you even async?
Python, do you even async?
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
project_docs
project_docsproject_docs
project_docs
 
OSDC 2018 | Distributed Monitoring by Gianluca Arbezzano
OSDC 2018 | Distributed Monitoring by Gianluca ArbezzanoOSDC 2018 | Distributed Monitoring by Gianluca Arbezzano
OSDC 2018 | Distributed Monitoring by Gianluca Arbezzano
 
OSDC 2018 - Distributed monitoring
OSDC 2018 - Distributed monitoringOSDC 2018 - Distributed monitoring
OSDC 2018 - Distributed monitoring
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Odp
OdpOdp
Odp
 
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
DevOps Fest 2019. Gianluca Arbezzano. DevOps never sleeps. What we learned fr...
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 

Mais de Steve Hoffman

Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformNerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformSteve Hoffman
 
Combating DNS Exfiltration in AWS - AWS Midwest Community Day 2018
Combating DNS Exfiltration in AWS - AWS Midwest Community Day 2018Combating DNS Exfiltration in AWS - AWS Midwest Community Day 2018
Combating DNS Exfiltration in AWS - AWS Midwest Community Day 2018Steve Hoffman
 
Combating DNS Exfiltration in AWS - BSidesChicago 2018
Combating DNS Exfiltration in AWS - BSidesChicago 2018Combating DNS Exfiltration in AWS - BSidesChicago 2018
Combating DNS Exfiltration in AWS - BSidesChicago 2018Steve Hoffman
 
How Open Source is Transforming the Internet. Again.
How Open Source is Transforming the Internet. Again.How Open Source is Transforming the Internet. Again.
How Open Source is Transforming the Internet. Again.Steve Hoffman
 
flAWS Walkthrough - AWS Chicago Meetup 8/8/2017
flAWS Walkthrough - AWS Chicago Meetup 8/8/2017flAWS Walkthrough - AWS Chicago Meetup 8/8/2017
flAWS Walkthrough - AWS Chicago Meetup 8/8/2017Steve Hoffman
 

Mais de Steve Hoffman (6)

Byte Sized Rust
Byte Sized RustByte Sized Rust
Byte Sized Rust
 
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformNerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
 
Combating DNS Exfiltration in AWS - AWS Midwest Community Day 2018
Combating DNS Exfiltration in AWS - AWS Midwest Community Day 2018Combating DNS Exfiltration in AWS - AWS Midwest Community Day 2018
Combating DNS Exfiltration in AWS - AWS Midwest Community Day 2018
 
Combating DNS Exfiltration in AWS - BSidesChicago 2018
Combating DNS Exfiltration in AWS - BSidesChicago 2018Combating DNS Exfiltration in AWS - BSidesChicago 2018
Combating DNS Exfiltration in AWS - BSidesChicago 2018
 
How Open Source is Transforming the Internet. Again.
How Open Source is Transforming the Internet. Again.How Open Source is Transforming the Internet. Again.
How Open Source is Transforming the Internet. Again.
 
flAWS Walkthrough - AWS Chicago Meetup 8/8/2017
flAWS Walkthrough - AWS Chicago Meetup 8/8/2017flAWS Walkthrough - AWS Chicago Meetup 8/8/2017
flAWS Walkthrough - AWS Chicago Meetup 8/8/2017
 

Último

VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 

Último (20)

VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 

Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

  • 1. Apache Flume Getting Logs/Data to Hadoop Steve Hoffman Chicago Hadoop User Group (CHUG) 2014-04-09T10:30:00Z
  • 2. About Me • Steve Hoffman • twitter: @bacoboy
 else: http://bit.ly/bacoboy
  • 3. About Me • Steve Hoffman • twitter: @bacoboy
 else: http://bit.ly/bacoboy • Tech Guy @Orbitz
  • 4. About Me • Steve Hoffman • twitter: @bacoboy
 else: http://bit.ly/bacoboy • Tech Guy @Orbitz • Wrote a book on Flume
  • 5. Why do I need Flume? • Created to deal with streaming data/logs to HDFS • Can’t mount HDFS (usually) • Can’t “copy” to files to HDFS if the files aren’t closed (aka log files) • Need to buffer “some”, then write and close a file — repeat • May involve multiple hops due to topology (# of machines, datacenter separation, etc). • A lot can go wrong here…
  • 6. Agent • Java daemon • Has a name (usually ‘agent’) • Receive data from sources and write events to 1 or more channels • Move events from 1 channel to sink. Remove from channel if successfully written.
  • 7. Events • Headers = Key/Value Pairs — Map<String,  String> • Body = byte array — byte[] • For example: 10.10.1.1 - - [29/Jan/2014:03:36:04 -0600] "HEAD /ping.html HTTP/1.1" 200 0 "-" "-" “-"! {“timestamp”:”1391986793111”, “host”:”server1.example.com”} 31302e31302e312e31202d202d205b32392f4a616e2f323031343a30333a33 363a3034202d303630305d202248454144202f70696e672e68746d6c204854 54502f312e312220323030203020222d2220222d2220222d22
  • 8. Channels • Place to hold Events • Memory or File Backed (also JDBC, but why?) • Bounded - Size is configurable • Resources aren’t infinite
  • 9. Sources • Feeds data to one or more Channels • Usually data pushed to it (listen for data on a socket. i.e. HTTP Source) or from Avro log4J appender. • Or can periodically poll another system and generate events (i.e. run a command every minute, and parse output into Event, Query a DB/Mongo/ etc.)
  • 10. Sinks • Move Events from a single Channel to a destination • Only removes from Channel if write successful • HDFSSink you’ll use the most
 — most likely…
  • 11. Configuration Sample # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1!
  • 12. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels}
  • 13. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels} Find instance name + type
  • 14. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels} Find instance name + type Connect channel(s)
  • 15. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels} Find instance name + type Connect channel(s) Apply type specific
 configurations
  • 16. Startup # Agent named ‘agent’! # Input (source)! agent.sources.r1.type = seq! agent.sources.r1.channels = c1! ! # Output (sink)! agent.sinks.k1.type = logger! agent.sinks.k1.channel = c1! ! # Channel! agent.channels.c1.type = memory! agent.channels.c1.capacity = 1000! ! # Wire everything together! agent.sources = r1! agent.sinks = k1! agent.channels = c1! name.{sources|sinks|channels} Find instance name + type Connect channel(s) Apply type specific
 configurations RTM - Flume User Guide
 https://flume.apache.org/FlumeUserGuide.html" or my book :)
  • 17. Configuration Sample (logs) Creating channels! Creating instance of channel c1 type memory! Created channel c1! Creating instance of source r1, type seq! Creating instance of sink: k1, type: logger! Channel c1 connected to [r1, k1]! Starting new configuration:{ sourceRunners:{r1=PollableSourceRunner: { source:org.apache.flume.source.SequenceGeneratorSource{name:r1,state:IDLE} counterGroup:{ name:null counters:{} } }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@19484a05 counterGroup:{ name:null counters:{} } }} channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} }! Event: { headers:{} body: 30 0 }! Event: { headers:{} body: 31 1 }! Event: { headers:{} body: 32 2 }! and so on…
  • 18. Using Cloudera Manager • Same stuff, just in
 a GUI • Centrally managed in a Database (instead of source control/Git) • Distributed from central location (instead of Chef/Puppet)
  • 20. Channel Selector • When more than 1 channel specified on Source • Replicating (Each channel gets a copy) - default • Multiplexing (Channel picked based on a header value) • Custom (If these don’t work for you - code one!)
  • 21. Channel Selector
 Replicating • Copy sent to all channels associated with Source agent.sources.r1.selector.type=replicating
 agent.sources.r1.channels=c1  c2  c3   • Can specify “optional” channels agent.sources.r1.selector.optional=c3   • Transaction success if all non-optional channels take the event (in this case c1 & c2)
  • 22. Channel Selector
 Multiplexing • Copy sent to only some of the channels agent.sources.r1.selector.type=multiplexing
 agent.sources.r1.channels=c1  c2  c3  c4   • Switch based on header key 
 (i.e. {“currency”:“USD”} → c1) agent.sources.r1.selector.header=currency
 agent.sources.r1.selector.mapping.USD=c1
 agent.sources.r1.selector.mapping.EUR=c2  c3
 agent.sources.r1.selector.default=c4
  • 23. Interceptors • Zero or more on Source (before written to channel) • Zero or more on Sink (after read from channel) • Or Both • Use for transformations of data in-flight (headers OR body) public  Event  intercept(Event  event);
 public  List<Event>  intercept(List<Event>  events);   • Return null or empty List to drop Events
  • 24. Interceptor Chaining • Processed in Order Listed in Configuration (source r1 example): agent.sources.r1.interceptors=i1  i2  i3
 agent.sources.r1.interceptors.i1.type=timestamp
 agent.sources.r1.interceptors.i1.preserveExisting=true
 agent.sources.r1.interceptors.i2.type=static
 agent.sources.r1.interceptors.i2.key=datacenter
 agent.sources.r1.interceptors.i2.value=CHI
 agent.sources.r1.interceptors.i3.type=host
 agent.sources.r1.interceptors.i3.hostHeader=relay
 agent.sources.r1.interceptors.i3.useIP=false   • Resulting Headers added before writing to Channel: {“timestamp”:“1392350333234”,  “datacenter”:“CHI”,   “relay”:“flumebox.example.com”}
  • 25. Morphlines • Interceptor and Sink forms. • See Cloudera Website/Blog • Created to ease transforms and Cloudera Search/Flume integration. • An example: # convert the timestamp field to "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
 # The input may match one of "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
 # or "yyyy-MM-dd'T'HH:mm:ss" or "yyyy-MM-dd".
 convertTimestamp {
 field : timestamp
 inputFormats : ["yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", 
 "yyyy-MM-dd'T'HH:mm:ss", "yyyy-MM-dd"]
 inputTimezone : America/Chicago
 outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
 outputTimezone : UTC
 }
  • 26. Avro • Apache Avro - Data Serialization • http://avro.apache.org/ • Storage Format and Wire Protocol • Self-Describing (schema written with the data) • Supports Compression of Data (not container — so MapReduce friendly — “splitable”) • Binary friendly — Doesn’t require records separated by n
  • 27. Avro Source/Sink • Preferred inter-agent transport in Flume • Simple Configuration (host + port for sink and port for source) • Minimal transformation needed for Flume Events • Version of Avro in client & server don’t need to match — only payload versioning matters
 (think protocol buffers vs Java serialization)
  • 28. Avro Source/Sink Config foo.sources=…
 foo.channels=channel-­‐foo
 foo.channels.channel-­‐foo.type=memory
 foo.sinks=sink-­‐foo
 foo.sinks.sink-­‐foo.channel=channel-­‐foo
 foo.sinks.sink-­‐foo.type=avro
 foo.sinks.sink-­‐foo.hostname=bar.example.com
 foo.sinks.sink-­‐foo.port=12345
 foo.sinks.sink-­‐foo.compression-­‐type=deflate   bar.sources=datafromfoo
 bar.sources.datafromfoo.type=avro
 bar.sources.datafromfoo.bind=0.0.0.0
 bar.sources.datafromfoo.port=12345
 bar.sources.datafromfoo.compression-­‐type=deflate
 bar.sources.datafromfoo.channels=channel-­‐bar
 bar.channels=channel-­‐bar
 bar.channels.channel-­‐bar.type=memory
 bar.sinks=…
  • 29. log4j Avro Sink • Remember that Web
 Server pushing data to
 Source? • Use the Flume Avro log4j appender! • log level, category, etc. become headers in Event • “message” String becomes the body
  • 30. log4j Configuration • log4j.properties sender (include flume-­‐ng-­‐sdk-­‐1.X.X.jar in project): log4j.appender.flume=org.apache.flume.clients.log4jappender.Log4jAp pender
 log4j.appender.flume.Hostname=example.com
 log4j.appender.flume.Port=12345
 log4j.appender.flume.UnsafeMode=true
 
 log4j.logger.org.example.MyClass=DEBUG,flume   • flume avro receiver: agent.sources=logs
 agent.sources.logs.type=avro
 agent.sources.logs.bind=0.0.0.0
 agent.sources.logs.port=12345
 agent.sources.logs.channels=…
  • 31. Avro Client • Send data to AvroSource from command line • Run flume program with avro-­‐client instead of agent parameter $  bin/flume-­‐ng  avro-­‐client  -­‐H  server.example.com
        -­‐p  12345  [-­‐F  input_file]   • Each line of the file (or stdin if no file given) becomes an event • Useful for testing or injecting data from outside Flume sources (ExecSource vs cronjob which pipes output to avro-­‐client).
  • 32. HDFSSink • Read from Channel and write 
 to a file in HDFS in chunks • Until 1 of 3 things happens: • some amount of time elapses (rollInterval) • some number of records have been written (rollCount) • some size of data has been written (rollSize) • Close that file and start a new one
  • 34. HDFS writing… drwxr-­‐x-­‐-­‐-­‐      -­‐  flume  flume                    0  2014-­‐02-­‐16  17:04  /data/2014/02/16/23
 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume                    0  2014-­‐02-­‐16  17:04  /data/2014/02/16/23/_log.1392591607925.avro.tmp
 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume              1877  2014-­‐02-­‐16  17:01  /data/2014/02/16/23/log.1392591607923.avro
 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume              1955  2014-­‐02-­‐16  17:02  /data/2014/02/16/23/log.1392591607924.avro
 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume              2390  2014-­‐02-­‐16  17:04  /data/2014/02/16/23/log.1392591798436.avro   • The zero length .tmp file is the current file. Won’t see real size until it closes (just like when you do a hadoop  fs  -­‐put) • Use …hdfs.inUsePrefix=_ to prevent open files from being included in MapReduce jobs
  • 35. Event Serializers • Defines how the Event gets written to Sink • Just the body as a UTF-8 String agent.sinks.foo-­‐sink.serializer=text   • Headers and Body as UTF-8 String agent.sinks.foo-­‐sink.serializer=header_and_text   • Avro (Flume record Schema) agent.sinks.foo-­‐sink-­‐serializer=avro_event   • Custom (none of the above meets your needs)
  • 38. Timezones are Evil • Daylight savings time causes problems twice a year (in Spring: no 2am hour. In Fall: twice the data during 2am hour — 02:15? Which one?) • Date processing in MapReduce jobs: Hourly jobs, filters, etc. • Dated paths: hdfs://NN/data/%Y/%m/%d/%H • Use UTC: -­‐Duser.timezone=UTC   • Use one of the ISO8601 formats like 2014-­‐02-­‐26T18:00:00.000Z • Sorts the way you usually want • Every time library supports it* - and if not, easy to parse.
  • 39. Generally Speaking… • Async handoff doesn’t work under load when bad stuff happens Write Read Filesystem or Queue or Database or whatever Not ∞
  • 41. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1
  • 42. Async Handoff Oops Flume Agent tail -F foo.log foo.log foo.log.1
  • 43. Async Handoff Oops Flume Agent tail -F foo.log foo.log foo.log.2
  • 44. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1 foo.log.2
  • 45. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1 foo.log.2 foo.log
  • 46. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1 foo.log.2 foo.log
  • 47. Async Handoff Oops Flume Agent tail -F foo.log foo.log.1 foo.log.2 foo.log X
  • 48. Don’t Use Tail • Tailing a file for input is bad - assumptions are made that aren’t guarantees. • Direct support removed during Flume rewrite • Handoff can go bad with files: when writer faster than reader • With Queue: when reader doesn’t read before expire time • No way to apply “back pressure” to tell tail there is a problem. It isn’t listening…
  • 49. What can I use? • If you can’t use the log4j Avro Appender… • Use logrotate to move old logs to “spool” directory • SpoolingDirectorySource • Finally, cron job to remove .COMPLETED files (for delayed delete) OR set deletePolicy=true (immediate) • Alternatively use log rotate with avro_client? (probably other ways too…)
  • 50. RAM or Disk Channels? Source:
 http://blog.scoutapp.com/articles/2011/02/10/ understanding-disk-i-o-when-should-you-be-worried
  • 51. Duplicate Events • Transactions only at Agent level • You may see Events more than once • Distributed Transactions are expensive • Just deal with in query/scrub phase — much less costly than trying to prevent it from happening
  • 52. Late Data • Data could be “late”/delayed • Outages • Restarts • Act of Nature • Only sure thing is a “database” — single write + ACK • Depending on your monitoring, it could be REALLY LATE.
  • 53. Monitoring • Know when it breaks so you can fix it before you can’t ingest new data (and it is lost) • This time window is small if volume is high • Flume Monitoring still WIP, but hooks are there
  • 54. Other Operational Concerns • resource utilization - number of open files when writing (file descriptors), disk space used for file channel, disk contention, disk speed* • number of inbound and outbound sockets - may need to tier (Avro Source/Sink) • minimize hops if possible - another place for data to get stuck
  • 55. Not everything is a nail • Flume is great for handling individual records • What if you need to compute an average? • Get a Stream Processing system • Storm (twitter’s) • Samza (linkedIn’s) • Others… • Flume can co-exist with these — use most appropriate tool
  • 56. Questions? …and thanks! Slides @ http://slideshare.net/bacoboy