Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. It has four goals in mind: Reliability, Scalability, Extensibility, and Manageability. Its horizontal scalable architecture offers fault-tolerant end-to-end delivery guarantees, support for low-latency event processing, provides a centralized management interface , and exposes metrics for ingest monitoring and reporting. It natively supports writing data to Hadoop's HDFS but also has a simple extension interface that allows it to write to other scalable data systems such as low-latency datastores or incremental search indexers.
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Chicago Data Summit: Flume: An Introduction
1.
2. Flume Logging for the Enterprise Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer Cloudera, Inc Chicago Data Summit, 4/26/11
3. Who Am I? Cloudera: Software Engineer on the Platform Team Flume Project Lead / Designer / Architect U of Washington: “On Leave” from PhD program Research in Systems and Programming Languages Previously: Computer Security, Embedded Systems. 3 Jonathan Hsieh, Chicago Data Summit 4/26/2011
4. An Enterprise Scenario You have a bunch of departments with servers generating log files. You are required keep logs and want to analyze and profit from them. Because of the volume of uncooked data, you’ve started using Cloudera’s Distribution including Apache Hadoop. … and you’ve got some several ad-hoc, legacy scripts/systems that copy data from servers/filers and then to HDFS. Jonathan Hsieh, Chicago Data Summit 4/26/2011 4 It’s log, log .. Everyone wants a log!
5. Ad-hoc gets complicated Black box? What happens if the person who wrote it leaves? Unextensible? Is it one-off or flexible enough to handle future needs? Unmanageable? Do you know when something goes wrong? Unreliable? If something goes wrong, will it recover? Unscalable? Hit a ingestion rate limit? Jonathan Hsieh, Chicago Data Summit 4/26/2011 5
6. Cloudera Flume Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing. Project Goals: Scalability Reliability Extensibility Manageability Openness 6 Jonathan Hsieh, Chicago Data Summit 4/26/2011
7. The Canonical Use Case HDFS 7 Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
8. The Canonical Use Case HDFS Flume Agent server 8 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
9. The Canonical Use Case HDFS Flume Master Agent server Agent Collector server Agent server Agent server 9 Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
10. The Canonical Use Case HDFS Flume Master Agent server Agent Collector server Agent server Agent server Agent server 10 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
11. Flume’s Key Abstractions Data path and control path Nodes are in the data path Nodes have a source and a sink They can take different roles A typical topology has agent nodes and collector nodes. Optionally it has processor nodes. Masters are in the control path. Centralized point of configuration. Specify sources and sinks Can control flows of data between nodes Use one master or use many with a ZK-backed quorum 11 node Agent sink source node Collector sink source Master Jonathan Hsieh, Chicago Data Summit 4/26/2011
12. Flume’s Key Abstractions Data path and control path Nodes are in the data path Nodes have a source and a sink They can take different roles A typical topology has agent nodes and collector nodes. Optionally it has processor nodes. Masters are in the control path. Centralized point of configuration. Specify sources and sinks Can control flows of data between nodes Use one master or use many with a ZK-backed quorum 12 node sink source node sink source Master Jonathan Hsieh, Chicago Data Summit 4/26/2011
13. Outline What is Flume? Scalability Horizontal scalability of all nodes and masters Reliability Fault-tolerance and High availability Extensibility Unix principle, all kinds of data, all kinds of sources, all kinds of sinks Manageability Centralized management supporting dynamic reconfiguration Openness Apache v2.0 License and an active and growing community 13 Jonathan Hsieh, Chicago Data Summit 4/26/2011
15. The Canonical Use Case HDFS Flume Agent server 15 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
16.
17. Node scalability limits and optimization plans 17 HDFS Agent server Agent Collector server Agent server Agent server In most deployments today, a single collector is not saturated. The current implementation can write at 20MB/s over 1GbE (~1.75 TB/day) due to unoptimized network usage. Assuming 1GbE with aggregate disk able to write at close to GbE rate, we can probably reach: 3-5x by batching to get to wire/disk limit (trade latency for throughput) 5-10x by compression to trade CPU for throughput (logs highly compressible) The limit is probably in the ball park of 40 effective TB/day/collector. Jonathan Hsieh, Chicago Data Summit 4/26/2011
18. Control plane is horizontally scalable A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 18 Node Jonathan Hsieh, Chicago Data Summit 4/26/2011
20. Failures Faults can happen at many levels Software applications can fail Machines can fail Networking gear can fail Excessive networking congestion or machine load A node goes down for maintenance. How do we make sure that events make it to a permanent store? 20 Jonathan Hsieh, Chicago Data Summit 4/26/2011
21. Tunable failure recovery modes HDFS HDFS HDFS Best effort Fire and forget Store on failure + retry Writes to disk on detected failure. One-hop TCP acks Failover when faults detected. End-to-end reliability Write ahead log on agent Checksums and End-to-end acks Data survives compound failures, and may be retried multiple times Agent Collector Collector Agent Collector Agent 21 Jonathan Hsieh, Chicago Data Summit 4/26/2011
22.
23. Use randomization to pre-specify failovers when many collectors exist Spread load if a collector goes down. Spread load if new collectors added to the system. Collector Agent Agent Collector Agent Agent Collector Agent Jonathan Hsieh, Chicago Data Summit 4/26/2011
24.
25. Use randomization to pre-specify failovers when many collectors exist Spread load if a collector goes down. Spread load if new collectors added to the system. 23 Collector Agent Agent Collector Agent Agent Collector Agent Jonathan Hsieh, Chicago Data Summit 4/26/2011
26. Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 24 Node Jonathan Hsieh, Chicago Data Summit 4/26/2011
27. Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Master ZK3 Master 25 Node Node Jonathan Hsieh, Chicago Data Summit 4/26/2011
28. Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 26 Node Jonathan Hsieh, Chicago Data Summit 4/26/2011
30. sink sink Flume is easy to extend Simple source and sink APIs An event streaming design Many simple operations composes for complex behavior Plug-in architecture so you can add your own sources, sinks and decorators and sinks 28 sink source deco fanout deco source deco Jonathan Hsieh, Chicago Data Summit 4/26/2011
31. Variety of Connectors Sources produce data Console, Exec, Syslog, Scribe, IRC, Twitter, In the works: JMS, AMQP, pubsubhubbub/RSS/Atom Sinks consume data Console, Local files, HDFS, S3 Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra (Riptano/DataStax), Voldemort, Elastic Search In the works: JMS, AMQP Decorators modify data sent to sinks Wire batching, compression, sampling, projection, extraction, throughput throttling Custom near real-time processing (Meebo) JRuby event modifiers (InfoChimps) Cryptographic extensions(Rearden) Streaming SQL in-stream-analytics system FlumeBase (Aaron Kimball) 29 source sink deco Jonathan Hsieh, Chicago Data Summit 4/26/2011
32. Migrating previous enterprise architecture 30 HDFS filer HDFS HDFS Flume Collector Agent poller Msg bus Flume Flume Agent amqp Collector Custom app Collector Agent avro Jonathan Hsieh, Chicago Data Summit 4/26/2011
33. Data ingestion pipeline pattern 31 HBase Incremental Search Idx HDFS Flume Agent Hive query Agent Agent Collector Fanout index hbase hdfs Agent svr Pig query Key lookup Range query Search query Faceted query Jonathan Hsieh, Chicago Data Summit 4/26/2011
35. Configuring Flume Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ; A concise and precise configuration language for specifying dataflows in a node. Dynamic updates of configurations Allows for live failover changes Allows for handling newly provisioned machines Allows for changing analytics 33 tail filter fanout roll hdfs console Jonathan Hsieh, Chicago Data Summit 4/26/2011
36. Output bucketing Automatic output file management Write hdfs files in over time based tags 34 HDFS Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt … Collector node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”) Jonathan Hsieh, Chicago Data Summit 4/26/2011
37. Configuration is straightforward node001: tail(“/var/log/app/log”) | autoE2ESink; node002: tail(“/var/log/app/log”) | autoE2ESink; … node100: tail(“/var/log/app/log”) | autoE2ESink; collector1: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector2: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector3: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) 35 Jonathan Hsieh, Chicago Data Summit 4/26/2011
38. Centralized Dataflow Management Interfaces One place to specify node sources, sinks and data flows. Basic Web interface Flume Shell Command line interface Scriptable Cloudera Enterprise Flume Monitor App Graphical web interface 36 Jonathan Hsieh, Chicago Data Summit 4/26/2011
39. Enterprise Friendly Integrated as part of CDH3 and Cloudera Enterprise RPM and DEB packaging for enterprise Linux Flume Node for Windows (beta) Cloudera Enterprise Support 24-7 Support SLAs Professional Services Cloudera Flume Features for Enterprises Kerberos Authentication support for writing to “secure” HDFS Detailed JSON-exposed metrics for monitoring integration (beta) Log4J collection (beta) High Availability via Multiple Master (alpha) Encrypted SSL / TLS data path and control path support (dev) Jonathan Hsieh, Chicago Data Summit 4/26/2011 37
40. An enterprise story 38 Kerberos HDFS Flume Collector tier Agent api Agent Collector api Agent api Win api Department Servers Agent api Agent Collector api Agent api Linux api D D D D D D Agent api Agent Collector api Agent api Linux api Active Directory / LDAP Jonathan Hsieh, Chicago Data Summit 4/26/2011
42. Flume is Open Source Apache v2.0 Open Source License Independent from Apache Software Foundation You have the right to fork or modify the software GitHub source code repository http://github.com/cloudera/flume Regular tarball update versions every 2-3 months. Regular CDH packaging updates every 3-4 months. Always looking for contributors and committors 40 Jonathan Hsieh, Chicago Data Summit 4/26/2011
48. : Multi Datacenter 42 HDFS Collector tier Agent api Agent api Agent Collector api Agent api API server Agent api Agent Collector api Agent api Agent api Agent api Agent Collector api Agent api Agent api Agent api Agent api Agent Collector api Agent proc Agent api Processor server Agent Collector api Agent api Agent proc Agent api Agent Collector api Agent api Agent proc Jonathan Hsieh, Chicago Data Summit 4/26/2011
49. : Multi Datacenter 43 HDFS Collector tier Agent api Agent api Agent Collector api Agent api API server Agent api Agent Collector api Agent api Agent api Agent api Agent Collector api Agent api Agent api Relay Agent api Agent api Agent Collector api Agent proc Agent api Processor server Agent Collector api Agent api Agent proc Agent api Agent Collector api Agent api Agent proc Jonathan Hsieh, Chicago Data Summit 4/26/2011
50. : Near Real-time Aggregator 44 HDFS DB Flume Agent Ad svr Collector Tracker Agent Ad svr Agent Ad svr Agent Ad svr quick reports Hive job verify reports Jonathan Hsieh, Chicago Data Summit 4/26/2011
51. Community Support Community-based mailing lists for support “an answer in a few days” User: https://groups.google.com/a/cloudera.org/group/flume-user Dev: https://groups.google.com/a/cloudera.org/group/flume-dev Community-based IRC chat room “quick questions, quick answers” #flume in irc.freenode.net 45 Jonathan Hsieh, Chicago Data Summit 4/26/2011
53. Summary Flume is a distributed, reliable, scalable, extensible system for collecting and delivering high-volume continuous event data such as logs. It is centrally managed, which allows for automated and adaptive configurations. This design allows for near-real time processing. Apache v2.0 License with active and growing community. Part of Cloudera’s Distribution including Apache Hadoop updated for CDH3u0 and Cloudera Enterprise. Several CDH users in community in production use. Several Cloudera Enterprise customers evaluating for production use. 47 Jonathan Hsieh, Chicago Data Summit 4/26/2011
54. Related systems Remote Syslogng / rsyslog / syslog Best effort. If server down, messages lost. Chukwa – Yahoo! / Apache Incubator Designed as a monitoring system for Hadoop. Minibatches, requires MapReduce batch processing to demultiplex data. New HBase dependent path One of the core contributors (Ari) currently works at Cloudera (not on Chukwa) Scribe - Facebook Only durable-on-failure reliability mechanisms. Collector disk is the bottleneck. Little visibility into system performance. Little support or documentation. Most scribe deploys replaced by “Data Freeway” Kafka - LinkedIn New system by LinkedIn. Pull model. Interesting, written in Scala 48 Jonathan Hsieh, Chicago Data Summit 4/26/2011
55. Questions? Contact info: jon@cloudera.com Twitter @jmhsieh 49 Jonathan Hsieh, Chicago Data Summit 4/26/2011
56.
57. Flow Isolation Isolate different kinds of data when and where it is generated Have multiple logical nodes on a machine Each has their own data source Each has their own data sink 51 Agent Collector Agent Collector Agent Collector Agent Collector Collector Agent Agent Collector Jonathan Hsieh, Chicago Data Summit 4/26/2011
58. Isolate different kinds of data when and where it is generated Have multiple logical nodes on a machine Each has their own data source Each has their own data sink Flow Isolation 52 Agent Collector Agent Agent Agent Collector Agent Agent Collector Jonathan Hsieh, Chicago Data Summit 4/26/2011
59. Image credits http://www.flickr.com/photos/victorvonsalza/3327750057/ http://www.flickr.com/photos/victorvonsalza/3207639929/ http://www.flickr.com/photos/victorvonsalza/3327750059/ http://www.emvergeoning.com/?m=200811 http://www.flickr.com/photos/juse/188960076/ http://www.flickr.com/photos/juse/188960076/ http://www.flickr.com/photos/23720661@N08/3186507302/ http://clarksoutdoorchairs.com/log_adirondack_chairs.html http://www.flickr.com/photos/dboo/3314299591/ 53 Jonathan Hsieh, Chicago Data Summit 4/26/2011
60. Master Service Failures An master machine should not be the single point of failure! Masters keep two kinds of information: Configuration information (node/flow configuration) Kept in ZooKeeper ensemble for persistent, highly available metadata store Failures easily recovered from Ephemeral information (heartbeat info, acks, metrics reports) Kept in memory Failures will lose data This information can be lazily replicated 54 Jonathan Hsieh, Chicago Data Summit 4/26/2011
61. Dealing with Agent failures We do not want to lose data Make events durable at the generation point. If a log generator goes down, it is not generating logs. If the event generation point fails and recovers, data will reach the end point Data is durable and survive if machines crashes and reboots Allows for synchronous writes in log generating applications. Watchdog program to restart agent if it fails. 55 Jonathan Hsieh, Chicago Data Summit 4/26/2011