SlideShare uma empresa Scribd logo
1 de 50
Flume Logging for the Enterprise Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer Cloudera, Inc Chicago Data Summit, 4/26/11
Who Am I? Cloudera: Software Engineer on the Platform Team Flume Project Lead / Designer / Architect U of Washington: “On Leave” from PhD program Research in Systems and Programming Languages Previously:  Computer Security, Embedded Systems.	 3 Jonathan Hsieh, Chicago Data Summit  4/26/2011
An Enterprise Scenario You have a bunch of departments with servers generating log files. You are required keep logs and want to analyze and profit from them. Because of the volume of uncooked data, you’ve started using Cloudera’s Distribution including Apache Hadoop. … and you’ve got some several ad-hoc, legacy scripts/systems that copy data from servers/filers and then to HDFS. Jonathan Hsieh, Chicago Data Summit  4/26/2011 4 It’s log, log .. Everyone wants a log!
Ad-hoc gets complicated Black box? What happens if the person who wrote it leaves? Unextensible? Is it one-off or flexible enough to handle future needs? Unmanageable? Do you know when something goes wrong? Unreliable? If something goes wrong, will it recover? Unscalable? Hit a ingestion rate limit? Jonathan Hsieh, Chicago Data Summit  4/26/2011 5
Cloudera Flume Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing. Project Goals: Scalability Reliability Extensibility Manageability Openness 6 Jonathan Hsieh, Chicago Data Summit  4/26/2011
The Canonical Use Case HDFS 7 Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit  4/26/2011
The Canonical Use Case HDFS Flume Agent server 8 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit  4/26/2011
The Canonical Use Case HDFS Flume Master Agent server Agent Collector server Agent server Agent server 9 Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit  4/26/2011
The Canonical Use Case HDFS Flume Master Agent server Agent Collector server Agent server Agent server Agent server 10 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit  4/26/2011
Flume’s Key Abstractions Data path and control path Nodes are in the data path  Nodes have a source and a sink They can take different roles A typical topology has agent nodes and collector nodes. Optionally it has processor nodes. Masters are in the control path. Centralized point of configuration. Specify sources and sinks  Can control flows of data between nodes Use one master or use many with a ZK-backed quorum 11 node Agent   sink source node Collector   sink source Master Jonathan Hsieh, Chicago Data Summit  4/26/2011
Flume’s Key Abstractions Data path and control path Nodes are in the data path  Nodes have a source and a sink They can take different roles A typical topology has agent nodes and collector nodes. Optionally it has processor nodes. Masters are in the control path. Centralized point of configuration. Specify sources and sinks  Can control flows of data between nodes Use one master or use many with a ZK-backed quorum 12 node   sink source node   sink source Master Jonathan Hsieh, Chicago Data Summit  4/26/2011
Outline What is Flume? Scalability Horizontal scalability of all nodes and masters Reliability Fault-tolerance and High availability  Extensibility Unix principle, all kinds of data, all kinds of sources, all kinds of sinks Manageability Centralized management supporting dynamic reconfiguration  Openness Apache v2.0 License and an active and growing community 13 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Scalability 14 Jonathan Hsieh, Chicago Data Summit  4/26/2011
The Canonical Use Case HDFS Flume Agent server 15 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit  4/26/2011
Data path is horizontally scalable Add collectors to increase availability and to handle more data Assumes a single agent will not dominate a collector Fewer connections to HDFS that tax the resource constrained NameNode Larger more efficient writes to HDFS and fewer files avoids “small file problem” Simplifies security story when supporting Kerborized HDFS or protected production servers. ,[object Object],Write log locally to avoid collector disk IO bottleneck and catastrophic failures Compression and batching  (trade cpu for network) Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks) 16 HDFS Agent server Agent Collector server Agent server Agent server Jonathan Hsieh, Chicago Data Summit  4/26/2011
Node scalability limits and optimization plans 17 HDFS Agent server Agent Collector server Agent server Agent server In most deployments today, a single collector is not saturated.  The current implementation can write at 20MB/s over 1GbE (~1.75 TB/day) due to unoptimized network usage. Assuming 1GbE with aggregate disk able to write at close to GbE rate, we can probably reach: 3-5x by batching to get to wire/disk limit (trade latency for throughput) 5-10x  by compression to trade CPU for throughput (logs highly compressible) The limit is probably in the ball park of 40 effective TB/day/collector. Jonathan Hsieh, Chicago Data Summit  4/26/2011
Control plane is horizontally scalable A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 18 Node Jonathan Hsieh, Chicago Data Summit  4/26/2011
Reliability 19 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Failures Faults can happen at many levels Software applications can fail Machines can fail Networking gear can fail Excessive networking congestion or machine load A node goes down for maintenance. How do we make sure that events make it to a permanent store? 20 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Tunable failure recovery modes HDFS HDFS HDFS Best effort Fire and forget Store on failure + retry Writes to disk on detected failure. One-hop TCP acks Failover when faults detected.  End-to-end reliability Write ahead log on agent Checksums and End-to-end acks Data survives compound failures, and may be retried multiple times Agent Collector Collector Agent Collector Agent 21 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Load balancing 22 Agent ,[object Object]
Use randomization to pre-specify failovers when many collectors exist Spread load if a collector goes down. Spread load if new collectors added to the system. Collector Agent Agent Collector Agent Agent Collector Agent Jonathan Hsieh, Chicago Data Summit  4/26/2011
Load balancing and collector failover Agent ,[object Object]
Use randomization to pre-specify failovers when many collectors exist Spread load if a collector goes down. Spread load if new collectors added to the system. 23 Collector Agent Agent Collector Agent Agent Collector Agent Jonathan Hsieh, Chicago Data Summit  4/26/2011
Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 24 Node Jonathan Hsieh, Chicago Data Summit  4/26/2011
Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Master ZK3 Master 25 Node Node Jonathan Hsieh, Chicago Data Summit  4/26/2011
Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 26 Node Jonathan Hsieh, Chicago Data Summit  4/26/2011
Extensibility 27 Jonathan Hsieh, Chicago Data Summit  4/26/2011
sink sink Flume is easy to extend Simple source and sink APIs An event streaming design Many simple operations composes for complex behavior Plug-in architecture so you can add your own sources, sinks and decorators and sinks 28 sink source deco fanout deco source deco Jonathan Hsieh, Chicago Data Summit  4/26/2011
Variety of Connectors Sources produce data Console, Exec, Syslog, Scribe, IRC, Twitter,  In the works: JMS, AMQP, pubsubhubbub/RSS/Atom Sinks consume data Console, Local files, HDFS, S3 Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra (Riptano/DataStax), Voldemort, Elastic Search In the works: JMS, AMQP Decorators modify data sent to sinks Wire batching, compression, sampling, projection, extraction, throughput throttling Custom near real-time processing  (Meebo) JRuby event modifiers (InfoChimps) Cryptographic extensions(Rearden) Streaming SQL in-stream-analytics system FlumeBase (Aaron Kimball) 29 source sink deco Jonathan Hsieh, Chicago Data Summit  4/26/2011
Migrating previous enterprise architecture 30 HDFS filer HDFS HDFS Flume Collector Agent poller Msg bus Flume Flume Agent amqp Collector Custom app Collector Agent avro Jonathan Hsieh, Chicago Data Summit  4/26/2011
Data ingestion pipeline pattern 31 HBase Incremental Search Idx HDFS Flume Agent Hive query Agent Agent Collector Fanout index hbase hdfs Agent svr Pig query Key lookup Range query Search query Faceted query Jonathan Hsieh, Chicago Data Summit  4/26/2011
Manageability Wheeeeee! 32 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Configuring Flume Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ; A concise and precise configuration language for specifying dataflows in a node. Dynamic updates of configurations Allows for live failover changes Allows for handling newly provisioned machines Allows for changing analytics 33 tail filter fanout roll hdfs console Jonathan Hsieh, Chicago Data Summit  4/26/2011
Output bucketing Automatic output file management  Write hdfs files in over time based tags 34 HDFS Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt … Collector node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”) Jonathan Hsieh, Chicago Data Summit  4/26/2011
Configuration is straightforward node001: tail(“/var/log/app/log”) | autoE2ESink; node002: tail(“/var/log/app/log”) | autoE2ESink; … node100: tail(“/var/log/app/log”) | autoE2ESink; collector1: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector2: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector3: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) 35 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Centralized Dataflow Management Interfaces One place to specify node sources, sinks and data flows. Basic Web interface   Flume Shell Command line interface Scriptable  Cloudera Enterprise Flume Monitor App Graphical web interface 36 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Enterprise Friendly Integrated as part of CDH3 and Cloudera Enterprise RPM and DEB packaging for enterprise Linux Flume Node for Windows (beta) Cloudera Enterprise Support  24-7 Support SLAs Professional Services Cloudera Flume Features for Enterprises Kerberos Authentication support for writing to “secure” HDFS Detailed JSON-exposed metrics for monitoring integration (beta) Log4J collection (beta) High Availability via Multiple Master (alpha) Encrypted SSL / TLS data path and control path support (dev) Jonathan Hsieh, Chicago Data Summit  4/26/2011 37
An enterprise story 38 Kerberos HDFS Flume Collector tier Agent api Agent Collector api Agent api Win api Department Servers Agent api Agent Collector api Agent api Linux api D D D D D D Agent api Agent Collector api Agent api Linux api Active Directory  / LDAP Jonathan Hsieh, Chicago Data Summit  4/26/2011
Openness And Community 39 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Flume is Open Source Apache v2.0 Open Source License  Independent from Apache Software Foundation You have the right to fork or modify the software GitHub source code repository http://github.com/cloudera/flume Regular tarball update versions every 2-3 months. Regular CDH packaging updates every 3-4 months. Always looking for contributors and committors 40 Jonathan Hsieh, Chicago Data Summit  4/26/2011
Growing user and developer community  41 ,[object Object]
Lots of innovation comes from community
Community folks are willing to tryincomplete features.
Early feedback and community fixes
Many interesting topologies in the communityJonathan Hsieh, Chicago Data Summit  4/26/2011
                       : Multi Datacenter 42 HDFS Collector tier Agent api Agent api Agent Collector api Agent api API server Agent api Agent Collector api Agent api Agent api Agent api Agent Collector api Agent api Agent api Agent api Agent api Agent Collector api Agent proc Agent api Processor server Agent Collector api Agent api Agent proc Agent api Agent Collector api Agent api Agent proc Jonathan Hsieh, Chicago Data Summit  4/26/2011
                       : Multi Datacenter 43 HDFS Collector tier Agent api Agent api Agent Collector api Agent api API server Agent api Agent Collector api Agent api Agent api Agent api Agent Collector api Agent api Agent api Relay Agent api Agent api Agent Collector api Agent proc Agent api Processor server Agent Collector api Agent api Agent proc Agent api Agent Collector api Agent api Agent proc Jonathan Hsieh, Chicago Data Summit  4/26/2011
             : Near Real-time Aggregator 44 HDFS DB Flume Agent Ad svr Collector Tracker  Agent Ad svr Agent Ad svr Agent Ad svr quick reports Hive job verify reports Jonathan Hsieh, Chicago Data Summit  4/26/2011

Mais conteúdo relacionado

Mais procurados

Considerations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfConsiderations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfhik_lhz
 
Operating and supporting HBase Clusters
Operating and supporting HBase ClustersOperating and supporting HBase Clusters
Operating and supporting HBase Clustersenissoz
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureDataWorks Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructureDevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructureAngelo Failla
 
Challenges for Deploying a High-Performance Computing Application to the Cloud
Challenges for Deploying a High-Performance Computing Application to the CloudChallenges for Deploying a High-Performance Computing Application to the Cloud
Challenges for Deploying a High-Performance Computing Application to the CloudIntel® Software
 
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle The Linux Foundation
 
SCU 2015 - Hyper-V Replica
SCU 2015 - Hyper-V ReplicaSCU 2015 - Hyper-V Replica
SCU 2015 - Hyper-V ReplicaMike Resseler
 
Texter blue - gdpr watchdog
Texter blue - gdpr watchdogTexter blue - gdpr watchdog
Texter blue - gdpr watchdogLuis Cabaceira
 
hadoop architecture -Big data hadoop
   hadoop architecture -Big data hadoop   hadoop architecture -Big data hadoop
hadoop architecture -Big data hadoopjasikadogra
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemCloudera, Inc.
 
SVC / Storwize: cache partition analysis (BVQ howto)
SVC / Storwize: cache partition analysis  (BVQ howto)   SVC / Storwize: cache partition analysis  (BVQ howto)
SVC / Storwize: cache partition analysis (BVQ howto) Michael Pirker
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
 
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...The Linux Foundation
 
Yeti DNS Project
Yeti DNS ProjectYeti DNS Project
Yeti DNS ProjectAPNIC
 

Mais procurados (20)

Considerations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmfConsiderations when implementing_ha_in_dmf
Considerations when implementing_ha_in_dmf
 
Operating and supporting HBase Clusters
Operating and supporting HBase ClustersOperating and supporting HBase Clusters
Operating and supporting HBase Clusters
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
 
ApacheCon-HBase-2016
ApacheCon-HBase-2016ApacheCon-HBase-2016
ApacheCon-HBase-2016
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructureDevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure
 
Challenges for Deploying a High-Performance Computing Application to the Cloud
Challenges for Deploying a High-Performance Computing Application to the CloudChallenges for Deploying a High-Performance Computing Application to the Cloud
Challenges for Deploying a High-Performance Computing Application to the Cloud
 
Drop the Pressure on your Production Server
Drop the Pressure on your Production ServerDrop the Pressure on your Production Server
Drop the Pressure on your Production Server
 
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
 
SCU 2015 - Hyper-V Replica
SCU 2015 - Hyper-V ReplicaSCU 2015 - Hyper-V Replica
SCU 2015 - Hyper-V Replica
 
Texter blue - gdpr watchdog
Texter blue - gdpr watchdogTexter blue - gdpr watchdog
Texter blue - gdpr watchdog
 
hadoop architecture -Big data hadoop
   hadoop architecture -Big data hadoop   hadoop architecture -Big data hadoop
hadoop architecture -Big data hadoop
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File System
 
SVC / Storwize: cache partition analysis (BVQ howto)
SVC / Storwize: cache partition analysis  (BVQ howto)   SVC / Storwize: cache partition analysis  (BVQ howto)
SVC / Storwize: cache partition analysis (BVQ howto)
 
XS 2008 Boston Capacity Planning
XS 2008 Boston Capacity PlanningXS 2008 Boston Capacity Planning
XS 2008 Boston Capacity Planning
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
 
XS Oracle 2009 Just Run It
XS Oracle 2009 Just Run ItXS Oracle 2009 Just Run It
XS Oracle 2009 Just Run It
 
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
 
Yeti DNS Project
Yeti DNS ProjectYeti DNS Project
Yeti DNS Project
 

Destaque

Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Yahoo Developer Network
 
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...Cloudera, Inc.
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoopzenyk
 
Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns Alex Silva
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialhadooparchbook
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cPart 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesArvind Prabhakar
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoopskaluska
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
 

Destaque (17)

Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010
 
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoop
 
Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cPart 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Data Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on HadoopData Ingestion, Extraction & Parsing on Hadoop
Data Ingestion, Extraction & Parsing on Hadoop
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of Things
 

Semelhante a Chicago Data Summit: Flume: An Introduction

Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshConfluentInc1
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task ComputingEric Van Hensbergen
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
 
An Open Source Case Study
An Open Source Case StudyAn Open Source Case Study
An Open Source Case Studywebhostingguy
 
Serverless (Distributed computing)
Serverless (Distributed computing)Serverless (Distributed computing)
Serverless (Distributed computing)Sri Prasanna
 
Is 12 Factor App Right About Logging
Is 12 Factor App Right About LoggingIs 12 Factor App Right About Logging
Is 12 Factor App Right About LoggingPhil Wilkins
 
Oracle 10g rac_overview
Oracle 10g rac_overviewOracle 10g rac_overview
Oracle 10g rac_overviewRobel Parvini
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBDDan Frincu
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systemsdairsie
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage SystemAmdocs
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage SystemAmdocs
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networkingOpenSourceIndia
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networkingsuniltomar04
 
Kiến trúc mạng cho hệ thống VDI - Mr Nguyễn Phạm Vĩnh Khương
Kiến trúc mạng cho hệ thống VDI - Mr Nguyễn Phạm Vĩnh KhươngKiến trúc mạng cho hệ thống VDI - Mr Nguyễn Phạm Vĩnh Khương
Kiến trúc mạng cho hệ thống VDI - Mr Nguyễn Phạm Vĩnh KhươngLac Viet Computing Corporation
 
Performance improvement techniques for software distributed shared memory
Performance improvement techniques for software distributed shared memoryPerformance improvement techniques for software distributed shared memory
Performance improvement techniques for software distributed shared memoryZongYing Lyu
 

Semelhante a Chicago Data Summit: Flume: An Introduction (20)

Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Hadoop
HadoopHadoop
Hadoop
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
An Open Source Case Study
An Open Source Case StudyAn Open Source Case Study
An Open Source Case Study
 
Serverless (Distributed computing)
Serverless (Distributed computing)Serverless (Distributed computing)
Serverless (Distributed computing)
 
Is 12 Factor App Right About Logging
Is 12 Factor App Right About LoggingIs 12 Factor App Right About Logging
Is 12 Factor App Right About Logging
 
Oracle 10g rac_overview
Oracle 10g rac_overviewOracle 10g rac_overview
Oracle 10g rac_overview
 
Pacemaker+DRBD
Pacemaker+DRBDPacemaker+DRBD
Pacemaker+DRBD
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
 
Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
 
Naveen nimmu sdn future of networking
Naveen nimmu sdn   future of networkingNaveen nimmu sdn   future of networking
Naveen nimmu sdn future of networking
 
Oracle Coherence
Oracle CoherenceOracle Coherence
Oracle Coherence
 
Kiến trúc mạng cho hệ thống VDI - Mr Nguyễn Phạm Vĩnh Khương
Kiến trúc mạng cho hệ thống VDI - Mr Nguyễn Phạm Vĩnh KhươngKiến trúc mạng cho hệ thống VDI - Mr Nguyễn Phạm Vĩnh Khương
Kiến trúc mạng cho hệ thống VDI - Mr Nguyễn Phạm Vĩnh Khương
 
Libra Library OS
Libra Library OSLibra Library OS
Libra Library OS
 
Performance improvement techniques for software distributed shared memory
Performance improvement techniques for software distributed shared memoryPerformance improvement techniques for software distributed shared memory
Performance improvement techniques for software distributed shared memory
 

Mais de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Chicago Data Summit: Flume: An Introduction

  • 1.
  • 2. Flume Logging for the Enterprise Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer Cloudera, Inc Chicago Data Summit, 4/26/11
  • 3. Who Am I? Cloudera: Software Engineer on the Platform Team Flume Project Lead / Designer / Architect U of Washington: “On Leave” from PhD program Research in Systems and Programming Languages Previously: Computer Security, Embedded Systems. 3 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 4. An Enterprise Scenario You have a bunch of departments with servers generating log files. You are required keep logs and want to analyze and profit from them. Because of the volume of uncooked data, you’ve started using Cloudera’s Distribution including Apache Hadoop. … and you’ve got some several ad-hoc, legacy scripts/systems that copy data from servers/filers and then to HDFS. Jonathan Hsieh, Chicago Data Summit 4/26/2011 4 It’s log, log .. Everyone wants a log!
  • 5. Ad-hoc gets complicated Black box? What happens if the person who wrote it leaves? Unextensible? Is it one-off or flexible enough to handle future needs? Unmanageable? Do you know when something goes wrong? Unreliable? If something goes wrong, will it recover? Unscalable? Hit a ingestion rate limit? Jonathan Hsieh, Chicago Data Summit 4/26/2011 5
  • 6. Cloudera Flume Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing. Project Goals: Scalability Reliability Extensibility Manageability Openness 6 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 7. The Canonical Use Case HDFS 7 Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 8. The Canonical Use Case HDFS Flume Agent server 8 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 9. The Canonical Use Case HDFS Flume Master Agent server Agent Collector server Agent server Agent server 9 Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 10. The Canonical Use Case HDFS Flume Master Agent server Agent Collector server Agent server Agent server Agent server 10 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 11. Flume’s Key Abstractions Data path and control path Nodes are in the data path Nodes have a source and a sink They can take different roles A typical topology has agent nodes and collector nodes. Optionally it has processor nodes. Masters are in the control path. Centralized point of configuration. Specify sources and sinks Can control flows of data between nodes Use one master or use many with a ZK-backed quorum 11 node Agent sink source node Collector sink source Master Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 12. Flume’s Key Abstractions Data path and control path Nodes are in the data path Nodes have a source and a sink They can take different roles A typical topology has agent nodes and collector nodes. Optionally it has processor nodes. Masters are in the control path. Centralized point of configuration. Specify sources and sinks Can control flows of data between nodes Use one master or use many with a ZK-backed quorum 12 node sink source node sink source Master Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 13. Outline What is Flume? Scalability Horizontal scalability of all nodes and masters Reliability Fault-tolerance and High availability Extensibility Unix principle, all kinds of data, all kinds of sources, all kinds of sinks Manageability Centralized management supporting dynamic reconfiguration Openness Apache v2.0 License and an active and growing community 13 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 14. Scalability 14 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 15. The Canonical Use Case HDFS Flume Agent server 15 Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent server Collector tier Agent tier Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 16.
  • 17. Node scalability limits and optimization plans 17 HDFS Agent server Agent Collector server Agent server Agent server In most deployments today, a single collector is not saturated. The current implementation can write at 20MB/s over 1GbE (~1.75 TB/day) due to unoptimized network usage. Assuming 1GbE with aggregate disk able to write at close to GbE rate, we can probably reach: 3-5x by batching to get to wire/disk limit (trade latency for throughput) 5-10x by compression to trade CPU for throughput (logs highly compressible) The limit is probably in the ball park of 40 effective TB/day/collector. Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 18. Control plane is horizontally scalable A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 18 Node Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 19. Reliability 19 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 20. Failures Faults can happen at many levels Software applications can fail Machines can fail Networking gear can fail Excessive networking congestion or machine load A node goes down for maintenance. How do we make sure that events make it to a permanent store? 20 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 21. Tunable failure recovery modes HDFS HDFS HDFS Best effort Fire and forget Store on failure + retry Writes to disk on detected failure. One-hop TCP acks Failover when faults detected. End-to-end reliability Write ahead log on agent Checksums and End-to-end acks Data survives compound failures, and may be retried multiple times Agent Collector Collector Agent Collector Agent 21 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 22.
  • 23. Use randomization to pre-specify failovers when many collectors exist Spread load if a collector goes down. Spread load if new collectors added to the system. Collector Agent Agent Collector Agent Agent Collector Agent Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 24.
  • 25. Use randomization to pre-specify failovers when many collectors exist Spread load if a collector goes down. Spread load if new collectors added to the system. 23 Collector Agent Agent Collector Agent Agent Collector Agent Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 26. Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 24 Node Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 27. Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Master ZK3 Master 25 Node Node Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 28. Control plane is Fault Tolerent A master controls dynamic configurations of nodes Uses consensus protocol to keep state consistent Scales well for configuration reads Allows for adaptive repartitioning in the future Nodes can talk to any master. Masters can talk to an existing ZK ensemble ZK1 Node Master ZK2 Node Master ZK3 Master 26 Node Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 29. Extensibility 27 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 30. sink sink Flume is easy to extend Simple source and sink APIs An event streaming design Many simple operations composes for complex behavior Plug-in architecture so you can add your own sources, sinks and decorators and sinks 28 sink source deco fanout deco source deco Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 31. Variety of Connectors Sources produce data Console, Exec, Syslog, Scribe, IRC, Twitter, In the works: JMS, AMQP, pubsubhubbub/RSS/Atom Sinks consume data Console, Local files, HDFS, S3 Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra (Riptano/DataStax), Voldemort, Elastic Search In the works: JMS, AMQP Decorators modify data sent to sinks Wire batching, compression, sampling, projection, extraction, throughput throttling Custom near real-time processing (Meebo) JRuby event modifiers (InfoChimps) Cryptographic extensions(Rearden) Streaming SQL in-stream-analytics system FlumeBase (Aaron Kimball) 29 source sink deco Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 32. Migrating previous enterprise architecture 30 HDFS filer HDFS HDFS Flume Collector Agent poller Msg bus Flume Flume Agent amqp Collector Custom app Collector Agent avro Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 33. Data ingestion pipeline pattern 31 HBase Incremental Search Idx HDFS Flume Agent Hive query Agent Agent Collector Fanout index hbase hdfs Agent svr Pig query Key lookup Range query Search query Faceted query Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 34. Manageability Wheeeeee! 32 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 35. Configuring Flume Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ; A concise and precise configuration language for specifying dataflows in a node. Dynamic updates of configurations Allows for live failover changes Allows for handling newly provisioned machines Allows for changing analytics 33 tail filter fanout roll hdfs console Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 36. Output bucketing Automatic output file management Write hdfs files in over time based tags 34 HDFS Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt … Collector node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”) Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 37. Configuration is straightforward node001: tail(“/var/log/app/log”) | autoE2ESink; node002: tail(“/var/log/app/log”) | autoE2ESink; … node100: tail(“/var/log/app/log”) | autoE2ESink; collector1: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector2: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector3: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) 35 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 38. Centralized Dataflow Management Interfaces One place to specify node sources, sinks and data flows. Basic Web interface Flume Shell Command line interface Scriptable Cloudera Enterprise Flume Monitor App Graphical web interface 36 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 39. Enterprise Friendly Integrated as part of CDH3 and Cloudera Enterprise RPM and DEB packaging for enterprise Linux Flume Node for Windows (beta) Cloudera Enterprise Support 24-7 Support SLAs Professional Services Cloudera Flume Features for Enterprises Kerberos Authentication support for writing to “secure” HDFS Detailed JSON-exposed metrics for monitoring integration (beta) Log4J collection (beta) High Availability via Multiple Master (alpha) Encrypted SSL / TLS data path and control path support (dev) Jonathan Hsieh, Chicago Data Summit 4/26/2011 37
  • 40. An enterprise story 38 Kerberos HDFS Flume Collector tier Agent api Agent Collector api Agent api Win api Department Servers Agent api Agent Collector api Agent api Linux api D D D D D D Agent api Agent Collector api Agent api Linux api Active Directory / LDAP Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 41. Openness And Community 39 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 42. Flume is Open Source Apache v2.0 Open Source License Independent from Apache Software Foundation You have the right to fork or modify the software GitHub source code repository http://github.com/cloudera/flume Regular tarball update versions every 2-3 months. Regular CDH packaging updates every 3-4 months. Always looking for contributors and committors 40 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 43.
  • 44. Lots of innovation comes from community
  • 45. Community folks are willing to tryincomplete features.
  • 46. Early feedback and community fixes
  • 47. Many interesting topologies in the communityJonathan Hsieh, Chicago Data Summit 4/26/2011
  • 48. : Multi Datacenter 42 HDFS Collector tier Agent api Agent api Agent Collector api Agent api API server Agent api Agent Collector api Agent api Agent api Agent api Agent Collector api Agent api Agent api Agent api Agent api Agent Collector api Agent proc Agent api Processor server Agent Collector api Agent api Agent proc Agent api Agent Collector api Agent api Agent proc Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 49. : Multi Datacenter 43 HDFS Collector tier Agent api Agent api Agent Collector api Agent api API server Agent api Agent Collector api Agent api Agent api Agent api Agent Collector api Agent api Agent api Relay Agent api Agent api Agent Collector api Agent proc Agent api Processor server Agent Collector api Agent api Agent proc Agent api Agent Collector api Agent api Agent proc Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 50. : Near Real-time Aggregator 44 HDFS DB Flume Agent Ad svr Collector Tracker Agent Ad svr Agent Ad svr Agent Ad svr quick reports Hive job verify reports Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 51. Community Support Community-based mailing lists for support “an answer in a few days” User: https://groups.google.com/a/cloudera.org/group/flume-user Dev: https://groups.google.com/a/cloudera.org/group/flume-dev Community-based IRC chat room “quick questions, quick answers” #flume in irc.freenode.net 45 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 52. Conclusions 46 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 53. Summary Flume is a distributed, reliable, scalable, extensible system for collecting and delivering high-volume continuous event data such as logs. It is centrally managed, which allows for automated and adaptive configurations. This design allows for near-real time processing. Apache v2.0 License with active and growing community. Part of Cloudera’s Distribution including Apache Hadoop updated for CDH3u0 and Cloudera Enterprise. Several CDH users in community in production use. Several Cloudera Enterprise customers evaluating for production use. 47 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 54. Related systems Remote Syslogng / rsyslog / syslog Best effort. If server down, messages lost. Chukwa – Yahoo! / Apache Incubator Designed as a monitoring system for Hadoop. Minibatches, requires MapReduce batch processing to demultiplex data. New HBase dependent path One of the core contributors (Ari) currently works at Cloudera (not on Chukwa) Scribe - Facebook Only durable-on-failure reliability mechanisms. Collector disk is the bottleneck. Little visibility into system performance. Little support or documentation. Most scribe deploys replaced by “Data Freeway” Kafka - LinkedIn New system by LinkedIn. Pull model. Interesting, written in Scala 48 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 55. Questions? Contact info: jon@cloudera.com Twitter @jmhsieh 49 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 56.
  • 57. Flow Isolation Isolate different kinds of data when and where it is generated Have multiple logical nodes on a machine Each has their own data source Each has their own data sink 51 Agent Collector Agent Collector Agent Collector Agent Collector Collector Agent Agent Collector Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 58. Isolate different kinds of data when and where it is generated Have multiple logical nodes on a machine Each has their own data source Each has their own data sink Flow Isolation 52 Agent Collector Agent Agent Agent Collector Agent Agent Collector Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 59. Image credits http://www.flickr.com/photos/victorvonsalza/3327750057/ http://www.flickr.com/photos/victorvonsalza/3207639929/ http://www.flickr.com/photos/victorvonsalza/3327750059/ http://www.emvergeoning.com/?m=200811 http://www.flickr.com/photos/juse/188960076/ http://www.flickr.com/photos/juse/188960076/ http://www.flickr.com/photos/23720661@N08/3186507302/ http://clarksoutdoorchairs.com/log_adirondack_chairs.html http://www.flickr.com/photos/dboo/3314299591/ 53 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 60. Master Service Failures An master machine should not be the single point of failure! Masters keep two kinds of information: Configuration information (node/flow configuration) Kept in ZooKeeper ensemble for persistent, highly available metadata store Failures easily recovered from Ephemeral information (heartbeat info, acks, metrics reports) Kept in memory Failures will lose data This information can be lazily replicated 54 Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 61. Dealing with Agent failures We do not want to lose data Make events durable at the generation point. If a log generator goes down, it is not generating logs. If the event generation point fails and recovers, data will reach the end point Data is durable and survive if machines crashes and reboots Allows for synchronous writes in log generating applications. Watchdog program to restart agent if it fails. 55 Jonathan Hsieh, Chicago Data Summit 4/26/2011