This document provides an overview of large scale data ingestion using Apache Flume. It discusses why event streaming with Flume is useful, including its scalability, event routing capabilities, and declarative configuration. It also covers Flume concepts like sources, channels, sinks, and how they connect agents together reliably in a topology. The document dives into specific source, channel, and sink types including examples and configuration details. It also discusses interceptors, channel selectors, sink processors, and ways to integrate Flume into applications using client SDKs and embedded agents.
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
1. Large Scale Data Ingest Using NOT USE PUBLICLY
DO
Apache Flume PRIOR TO 10/23/12
Headline Goes Here
Hari Shreedharan
Speaker Name or Subhead Goes Here
Software Engineer , Cloudera
Apache Flume PMC member / committer
February 2013
1
2. Why event streaming with Flume is awesome
• Couldn’t I just do this with a shell script?
• What year is this, 2001? There is a better way!
• Scalable collection, aggregation of event data (i.e. logs)
• Dynamic, contextual event routing
• Low latency, high throughput
• Declarative configuration
• Productive out of the box, yet powerfully extensible
• Open source software
2
3. Lessons learned from Flume OG
• Hard to get predictable performance without decoupling tier
impedance
• Hard to scale-out without multiple threads at the sink level
• A lot of functionality doesn’t work well as a decorator
• People need a system that keeps the data flowing when there is
a network partition (or downed host in the critical path)
3
6. Basic Concepts
• Client • Valid Configuration
• Log4j Appender • Must have at least one
• Client SDK Channel
• Clientless Operation • Must have at least one
source or sink
• Agent
• Any number of sources
• Source
• Any number of channels
• Channel
• Any number of Sinks
• Sink
6
7. Concepts in Action
• Source: Puts events into the Channel
• Sink: Drains events from the Channel
• Channel: Store the events until drained
7
8. Flow Reliability
success
Reliability based on:
• Transactional Exchange between Agents
• Persistence Characteristics of Channels in the Flow
Also Available:
• Built-in Load balancing Support
• Built-in Failover Support
8
9. Reliability
• Transactional guarantees from channel
• External client needs handle retry
• Built in avro-client to read streams
• Avro source for multi-hop flows
• Use Flume Client SDK for customization
9
12. Basic Configuration Rules
# Active components
agent1.sources = src1 • Only the named agents’ configuration loaded
agent1.channels = ch1
agent1.sinks = sink1
• Only active components’ configuration
# Define and configure src1 loaded within the agents’ configuration
agent1.sources.src1.type = netcat
agent1.sources.src1.channels = ch1
agent1.sources.src1.bind = 127.0.0.1
• Every Agent must have at least one channel
agent1.sources.src1.port = 10112
• Every Source must have at least one channel
# Define and configure sink1
agent1.sinks.sink1.type = logger • Every Sink must have exactly one channel
agent1.sinks.sink1.channel = ch1
• Every component must have a type
# Define and configure ch1
agent1.channels.ch1.type = memory
# Some other Agents’ configuration
agent2.sources = src1 src2
12
13. Deployment
Steady state inflow == outflow
4 Tier 1 agents at 100 events/sec (batch-size)
1 Tier 2 agent at 400 eps
13
14. Source
• Event Driven
• Supports Batch Processing
• Source Types:
• AVRO – RPC source – other Flume agents can send data to this source port
• THRIFT – RPC source (available in next Flume release)
• SPOOLDIR – pick up rotated log files
• HTTP – post to a REST service (extensible)
• JMS – ingest from Java Message Service
• SYSLOGTCP, SYSLOGUDP
• NETCAT
• EXEC
14
15. How Does a Source Work?
• Read data from external clients/other sinks
• Stores events in configured channel(s)
• Asynchronous to the other end of channel
• Transactional semantics for storing data
15
21. RPC Sources – Avro and Thrift
• Reading events from external client
• Only TCP
• Connecting two agents in a distributed flow
• Based on IPC thus failure notification is enabled
• Configuration
agent_foo.sources.rpcsource-1.type = avro/thrift
agent_foo.sources.rpcsource-1.bind = <host>
agent_foo.sources.rpcsource-1.port = <port>
21
22. Spooling Directory Source
• Parses rotated log files out of a “spool” directory
• Watches for new files, renames or deletes them when done
• The files must be immutable before being placed into the
watched directory
agent.sources.spool.type = spooldir
agent.sources.spool.spoolDir = /var/log/spooled-files
agent.sources.spool.deletePolicy = never OR immediate
22
23. HTTP Source
• Runs a web server that handles HTTP requests
• The handler is pluggable (can roll your own)
• Out of the box, an HTTP client posts a JSON array of events to
the server. Server parses the events and puts them on the
channel.
agent.sources.http.type = http
agent.sources.http.port = 8081
23
24. HTTP Source, cont’d.
• Default handler supports events that look like this:
[{
"headers" : {
"timestamp" : "434324343",
"host" : ”host1.example.com"
},
"body" : ”arbitrary data in body string"
},
{
"headers" : {
"namenode" : ”nn01.example.com",
"datanode" : ”dn102.example.com"
},
"body" : ”some other arbitrary data in body string"
}]
24
25. Exec Source
• Reading data from a output of a command
• Can be used for ‘tail –F ..’
• Doesn’t handle failures ..
Configuration:
agent_foo.sources.execSource.type = exec
agent_foo.sources.execSource.command = 'tail -F /var/log/weblog.out’
25
26. JMS Source
• Reads messages from a JMS queue or topic, converts them to Flume events
and puts those events onto the channel.
• Pluggable Converter that by default converrts Bytes, Text, and Object
messages into Flume Events.
• So far, tested with ActiveMQ. We’d like to hear about experiences with any
other JMS implementations.
agent.sources.jms.type = jms
agent.sources.jms.initialContextFactory =
org.apache.activemq.jndi.ActiveMQInitialContextFactory
agent.sources.jms.providerURL = tcp://mqserver:61616
agent.sources.jms.destinationName = BUSINESS_DATA
agent.sources.jms.destinationType = QUEUE
26
27. Interceptor
• Applied to Source configuration element
• One source can have many interceptors
• Chain-of-responsibility
• Can be used for tagging, filtering, routing*
• Built-in interceptors:
• TIMESTAMP
• HOST
• STATIC
• REGEX EXTRACTOR
27
31. Channel
• Passive Component
• Determines the reliability of a flow
• “Stock” channels that ship with Flume
• FILE – provides durability; most people use this
• MEMORY – lower latency for small writes, but not durable
• JDBC – provides full ACID support, but has performance issues
31
32. File Channel
• Write Ahead Log implementation
• Configuration:
agent1.channels.ch1.type = FILE
agent1.channels.ch1.checkpointDir = <dir>
agent1.channels.ch1.dataDirs = <dir1> <dir2>…
agent1.channels.ch1.capacity = N (100k)
agent1.channels.ch1.transactionCapacity = n
agent1.channels.ch1.checkpointInterval = n (30000)
agent1.channels.ch1.maxFileSize = N (1.52G)
agent1.channels.ch1.write-timeout = n (10s)
agent1.channels.ch1.checkpoint-timeout = n (600s)
32
33. File Channel
Flume Event Queue
• In memory representation of the
channel
• Maintains queue of pointers to
the data on disk in various log
files. Reference counts log files.
• Is memory mapped to a check
point file
Log Files
• On disk representation of actions
(Puts/Takes/Commits/Rollbacks)
• Maintains actual data
• Log files with 0 refs get deleted
33
34. Sink
• Polling Semantics
• Supports Batch Processing
• Specialized Sinks
• HDFS (Write to HDFS – highly configurable)
• HBASE, ASYNCHBASE (Write to Hbase)
• AVRO (IPC Sink – Avro Source as IPC source at next hop)
• THRIFT (IPC Sink – Thrift Source as IPC source at next hop)
• FILE_ROLL (Local disk, roll files based on size, # of events etc)
• NULL, LOGGER (For Testing Purposes)
• ElasticSearch
• IRC
34
35. HDFS Sink
• Writes events to HDFS (what!)
• Configuring (taken from Flume User Guide):
35
36. HDFS Sink
• Supports dynamic directory naming using tags
• Use event headers : %{header}
• Eg: hdfs://namenode/flume/%{header}
• Use timestamp from the event header
• Use various options to use this.
• Eg: hdfs://namenode/flume/%{header}/%Y-%m-%D/
• Use roundValue and roundUnit to round down the timestamp to use
separate directories.
• Within a directory – files rolled based on:
• rollInterval – time since last event was written
• rollSize – max size of the file
• rollCount – max # of events per file
36
37. AsyncHBase Sink
• Insert events and increments into Hbase
• Writes events asynchronously at very high rate.
• Easy to configure:
• table
• columnFamily
• batchSize - # events per txn.
• timeout - how long to wait for success callback
• serializer/serializer.* - Custom serializer can decide how and where the events
are written out.
37
38. IPC Sinks (Avro/Thrift)
• Sends events to the next hop’s IPC Source
• Configuring:
• hostname
• port
• batch-size - # events per txn/batch sent to next hop
• request-timeout – how long to wait for success of batch
38
39. Serializers
• Supported by HDFS, Hbase and File_Roll sink
• Convert the event into a format of user’s choice.
• In case of Hbase, convert an event into Puts and Increments.
39
40. Sink Group
• Top-level element, needed to declare sink processors
• A sink can be at most in one group at anytime
• By default all sinks are in their individual default sink group
• Default sink group is a pass-through
• Deactivating sink-group does not deactivate the sink!!
40
41. Sink Processor
• Acts as a Sink Proxy
• Can work with multiple Sinks
• Built-in Sink Processors:
• DEFAULT
• FAILOVER
• LOAD_BALANCE
• Applied via Groups!
• A Top-Level Component
41
43. Clients: Embedded agent
• More advanced RPC client. Integrates a channel.
• Minimal example:
properties.put("channel.type", "memory");
properties.put("channel.capacity", "200");
properties.put("sinks", "sink1");
properties.put("sink1.type", "avro");
properties.put("sink1.hostname", "collector1.example.com");
properties.put("sink1.port", "5564");
EmbeddedAgent agent = new EmbeddedAgent("myagent");
agent.configure(properties);
agent.start();
List<Event> events = new ArrayList<Event>();
events.add(event);
agent.putAll(events);
agent.stop();
• See Flume Developer Guide for more details and examples.
43
44. General Caveats
• Reliability = function of channel type, capacity, and system
redundancy
• Carefully size the channels for needed capacity
• Set batch sizes based on projected drain requirements
• Number of cores should be ½ total # of sources & sinks
combined in an agent
44
46. Summary
• Clients send Events to Agents
• Each agent hosts Flume components: Source, Interceptors, Channel
Selectors, Channels, Sink Processors & Sinks
• Sources & Sinks are active components, Channels are passive
• Source accepts Events, passes them through Interceptor(s), and if not
filtered, puts them on channel(s) selected by the configured Channel
Selector
• Sink Processor identifies a sink to invoke, that can take Events from a
Channel and send it to its next hop destination
• Channel operations are transactional to guarantee one-hop delivery
semantics
• Channel persistence provides end-to-end reliability
46
47. Reference docs (1.3.1 release)
User Guide:
flume.apache.org/FlumeUserGuide.html
Dev Guide:
flume.apache.org/FlumeDeveloperGuide.html
47
48. Blog posts
• Flume performance tuning
https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1
• Flume and Hbase
https://blogs.apache.org/flume/entry/streaming_data_into_apache_hbase
• File Channel Innards
https://blogs.apache.org/flume/entry/apache_flume_filechannel
• Architecture of Flume NG
https://blogs.apache.org/flume/entry/flume_ng_architecture
48
49. Contributing: How to get involved!
• Join the mailing lists:
• user-subscribe@flume.apache.org
• dev-subscribe@flume.apache.org
• Look at the code
• github.com/apache/flume – Mirror of the Apache Flume git repo
• File or fix a JIRA
• issues.apache.org/jira/browse/FLUME
• More on how to contribute:
• cwiki.apache.org/confluence/display/FLUME/How+to+Contribute
49
51. DO NOT USE PUBLICLY
Thank you PRIOR TO 10/23/12
Headline Goes Here
Reach out on the mailing lists!
Speaker Name or Subhead Goes Here
Follow me on Twitter: @harisr1234
51
Notas do Editor
If you have a server farm that emits log data in GB/min, then you could hack together a very simple aggregator, but chances are it won't provide reliability, manageability, or scalability.This is why many use Flume: an out-of-the-box aggregator that is an open-source, high-performing, reliable, and scalable aggregator for streaming data.You don’t want to risk outages or scripts failing causing an overload on spindles.Flume is declarative in that you don’t have to write codeFlume is extensible in that you can write your own components to go on top of Flume, which allow you to modify the behavior and feature-set of Flume out of the boxFlume has one hop delivery, if you want end-to-end reliability, use file channel, which we’ll talk about laterNo acknowledgements from terminal destination to client b/c then client forced to hold all events until ack receivedYou want these systems to be occupy less disk footprintSet up redundant flows if you’re concerned about hardware failures, flume doesn’t support splicing or raid out of the box
With Flume NG, there is built-in buffering capacity at every hop. Thus, data and events will be preserved. In regards, to single-hop reliability, the degree of reliability is based on the channel: memory channel and recoverable memory channel are best-effort, whereas file channel and jdbc channel are reliable because you write to disk.OGgarden hose connected from faucet to sprinklercontiguous flow except when you pinch the hose in the middleNGhose connects multiple water tanks (i.e. channels/passive buffers) from faucet to sprinklerif you pinch the hose, the flow doesn't stop1. decouple impedance between producers and consumers2. dynamic routing capabilities (can shutdown one tank to re-route traffic)3. unrestricted capacity (consumer's input no longer restricted by producer's output as one tank can feed into multiple downstream tanks)
Flume flowSimplest individual component is agent which can talk to each other and to hdfs,hbase, etcClients talk to agents
Clientless operation – agent loads up info using specialized sourcesAgent is a collection of sources, channels, sinksSource captures events from external, only exec source can generate events on its ownChannel is buffer between source and sinkSink has responsibility of draining channel out to another agent or terminal point like hdfsYou can’t have a source with no place to write events
In upper diagram, the 3 agents’ flow is healthyIn lower diagram, sink fails to communicate with downstream source thus reservoir fills up and the reservoir filling up cascades upstream, buffering from downstream hardware failuresBut no events are lost until all channels in that flow fill up, at which point the sources report failure to the clientSteadystate flow restored when link becomes active
WHAT MAKES IT ACTIVE?Src2 is inactive b/c it’s not in the active setDefine multiple sources for same agent by space separated listsFan out: source write to two channelsMultiple sinks drain same channel for increased throughputSource can write to multiple channelsChannel is implemented as queue: source appends data to end of queue and sink drains from head of queueConfig file is checked at startup and changes are checked for every 30 sec – don’t have to restart agents if config file changedWhat use-case would need to have multiple sinks draining the same channel?Sources are multi-threaded and greedily implemented (for improved throughput)Sinks are single-threaded and have fixed capacity on what they can drainImpedance mismatch between sources and sinksSources will expand to accommodate load, bursty traffic, so downstream won’t be affectedSinks will drain steadilyAdd another sink to the same channel to meet steady-state requirement
Four tier1 agents drain into one tier2 agent then distributes its load over two tier3 agentsYou can have a single config file for 3 agents and you pass that around your deployment and you’re doneAt any node, ingest rate must equal exit rate
Avro is standardChannels support transactionsflume sources:avroexecsyslogspooling directoryhttpembedded agentJMS
Transactional semantics for storing dataif sink takes data out, it will commit only if source on next hop has committed its data
Use-cases:You want the same data to go into hdfs and into hbasePriority based routingAny contextual routing
JMS – client talks to broker, which handles failures
on avro, once the source commits the events on its channel via a put transaction, the source sends a success msg to the previous hop and the sink on the previous hop deletes these events once it commits the take transaction
Takes a command as a config parameter and executes that command, whatever it writes to stdout, it will write each event out to the channelIf channel is full, data is dropped and lostDuring file rotation, if event fails, then data is lost
Interceptor is transparent component that gets applied to the flow and can do filtering and minor modification of the event but can’t have interceptor do multiplication of event – e.g. can’t do decompression of event because batching, compression are framework level concerns that Flume should addressOverall number of events emitted by the interceptor can not be more than the number of events that came into the interceptor – you can drop but can’t add events (which would go over the transaction capacity)
Interceptor never returns null b/c it’s passed to next interceptor or channel
File channel is the recommended channel: reliable channel (no data loss in outage), scales linearly with additional spindles (more disks, better performance), better durability guarantees than memory channelMemory channel can’t scale to large capacity because bound by memoryJDBC not recommended due to slow performance (don’t mention deadlock)
Recommended to use three disks: one disk for checkpointing and two disks for dataKeep-alive – wait 3 seconds for the blocks to free up – usually only used in high stress environments
Three files: checkpoint file (memory mapped by flume event queue), log1 and log2Checkpoint file = FE QIf you lose FEQ, you don’t lose data since it’s in the log files but takes a long time to remap data into memoryChannel’s main operations are done on top of flume event queue, which is a queue of pointers which point to different locations and different log filesFEQ is queue of active data that exists within file channel and contains reference count of filesEach log file contains metadata of itself – write-ahead log, not direct serialization of dataFEQ doesn’t store data, size of your events don’t impact the FEQ
Polling semantics – sink continually polls to see if events are availableAsynchbase sink recommended over hbase sink (synchronous hbaseapi) for better performanceNull sink will drop events to the floor
Polling semantics – sink continually polls to see if events are availableAsynchbase sink recommended over hbase sink (synchronous hbaseapi) for better performanceNull sink will drop events to the floor
Groups active sinks together and then adds a processorLoad_balance - shipped w round robin and random distribution and back off – but you can write your own selection algorithm and plug it into the sink processorFailover supports round robin, random, and back off (won’t try failed sink until back off time period is over)
Interface that exposes itisActive can be used for testingThis is a way of getting data into flumeClient can talk to flume’s avro/thrift source