SlideShare a Scribd company logo
1 of 26
1
Headline Goes Here
Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Real Time Data Ingest into Hadoop
using Flume
Hari Shreedharan
Committer and PMC Member, Apache Flume
Software Engineer, Cloudera
What is Flume
• Collection, Aggregation of streaming Event Data
• Typically used for log data
• Significant advantages over ad-hoc solutions
• Reliable, Scalable, Manageable, Customizable and High
Performance
• Declarative, Dynamic Configuration
• Contextual Routing
• Feature rich
• Fully extensible
2
Core Concepts: Event
An Event is the fundamental unit of data transported by Flume
from its point of origination to its final destination. Event is a
byte array payload accompanied by optional headers.
• Payload is opaque to Flume
• Headers are specified as an unordered collection of string key-
value pairs, with keys being unique across the collection
• Headers can be used for contextual routing
3
Core Concepts: Client
An entity that generates events and sends them to one or more Agents.
• Example
• Flume log4j Appender
• Custom Client using Client SDK (org.apache.flume.api)
• Embedded Agent – An agent embedded within your application
• Decouples Flume from the system where event data is consumed
from
• Not needed in all cases
4
Core Concepts: Agent
A container for hosting Sources, Channels, Sinks and other
components that enable the transportation of events from one
place to another.
• Fundamental part of a Flume flow
• Provides Configuration, Life-Cycle Management, and
Monitoring Support for hosted components
5
Typical Aggregation Flow
6
[Client]+  Agent [ Agent]*  Destination
Core Concepts: Source
An active component that receives events from a specialized
location or mechanism and places it on one or Channels.
• Different Source types:
• Specialized sources for integrating with well-known systems.
Example: Syslog, Netcat
• Auto-Generating Sources: Exec, SEQ
• IPC sources for Agent-to-Agent communication: Avro
• Require at least one channel to function
7
Sources
• Different Source types:
• Specialized sources for integrating with well-known systems.
Example: Spooling Files, Syslog, Netcat, JMS
• Auto-Generating Sources: Exec, SEQ
• IPC sources for Agent-to-Agent communication: Avro, Thrift
• Require at least one channel to function
8
Core Concepts: Channel
A passive component that buffers the incoming events until they are
drained by Sinks.
• Different Channels offer different levels of persistence:
• Memory Channel: volatile
• Data lost if JVM or machine restarts
• File Channel: backed by WAL implementation
• Data not lost unless the disk dies.
• Eventually, when the agent comes back data can be accessed.
• Channels are fully transactional
• Provide weak ordering guarantees
• Can work with any number of Sources and Sinks.
9
Core Concepts: Sink
An active component that removes events from a Channel and
transmits them to their next hop destination.
• Different types of Sinks:
• Terminal sinks that deposit events to their final destination. For
example: HDFS, HBase, Morphline-Solr, Elastic Search
• Sinks support serialization to user’s preferred formats.
• HDFS sink supports time-based and arbitrary bucketing of data while
writing to HDFS.
• IPC sink for Agent-to-Agent communication: Avro, Thrift
• Require exactly one channel to function
10
Flow Reliability
Reliability based on:
• Transactional Exchange between Agents
• Persistence Characteristics of Channels in the Flow
Also Available:
• Built-in Load balancing Support
• Built-in Failover Support
11
Flow Reliability
Normal Flow
Communication Failure between Agents
Communication Restored, Flow back to Normal
12
Flow Handling
Channels decouple impedance of upstream and downstream
• Upstream burstiness is damped by channels
• Downstream failures are transparently absorbed by channels
 Sizing of channel capacity is key in realizing these benefits
13
Configuration
• Java Properties File Format
# Comment line
key1 = value
key2 = multi-line 
value
• Hierarchical, Name Based Configuration
agent1.channels.myChannel.type = FILE
agent1.channels.myChannel.capacity = 1000
• Uses soft references for establishing associations
agent1.sources.mySource.type = HTTP
agent1.sources.mySource.channels = myChannel
14
Configuration
Typical Deployment
• All agents in a specific
tier could be given the
same name
• One configuration file
with entries for three
agents can be used
throughout
15
Contextual Routing
Achieved using Interceptors and Channel Selectors
16
Interceptors
Interceptor
An Interceptor is a component applied to a source in pre-specified order
to enable decorating and filtering of events where necessary.
• Built-in Interceptors allow adding headers such as timestamps,
hostname, static markers etc.
• Custom interceptors can introspect event payload to create specific
headers where necessary
17
Contextual Routing
Channel Selector
A Channel Selector allows a Source to select one or more
Channels from all the Channels that the Source is configured with
based on preset criteria.
• Built-in Channel Selectors:
• Replicating: for duplicating the events
• Multiplexing: for routing based on headers
18
Contextual Routing
• Terminal Sinks can directly use Headers to make destination
selections
• HDFS Sink can use headers values to create dynamic path for files
that the event will be added to.
• Some headers such as timestamps can be used in a more
sophisticated manner
• Custom Channel Selector can be used for doing specialized
routing where necessary
19
Load Balancing and Failover
Sink Processor
A Sink Processor is responsible for invoking one sink from an
assigned group of sinks.
• Built-in Sink Processors:
• Load Balancing Sink Processor – using RANDOM, ROUND_ROBIN
or Custom selection algorithm
• Failover Sink Processor
• Default Sink Processor
20
Sink Processor
• Invoked by Sink Runner
• Acts as a proxy for a Sink
21
Sink Processor Configuration
• A Sink can exist in at most one group
• A Sink that is not in any group is handled via Default Sink
Processor
• Caution:
Removing a Sink Group does not make the sinks inactive!
22
Client API
• Simple API that can be used to send data to Flume agents.
• Simplest form – send a batch of events to one agent.
• Can be used to send data to multiple agents in a round-robin,
random or failover fashion (send data to one till it fails).
• Java only.
• flume.thrift can be used to generate code for other
languages.
• Use with Thrift source.
23
Embedded Agent
• Client API throws exceptions if data could not be sent.
• Applications may not be able to tolerate this.
• Embedded Agent – A (limited) Flume agent within your application
• Has a channel – so buffers data in-memory or on-disk till the data is
sent or the channel is full.
• Throws exceptions only if the channel is full (or error writing to
channel).
• Cushion for application if something causes data to be stuck within
the application
• Supports sending data to other Flume agents only, no HDFS, HBase
etc.
24
Summary
• Clients send Events to Agents
• Agents hosts number Flume components – Source, Interceptors, Channel
Selectors, Channels, Sink Processors, and Sinks.
• Sources and Sinks are active components, where as Channels are passive
• Source accepts Events, passes them through Interceptor(s), and if not
filtered, puts them on channel(s) selected by the configured Channel
Selector
• Sink Processor identifies a sink to invoke, that can take Events from a
Channel and send it to its next hop destination
• Channel operations are transactional to guarantee one-hop delivery
semantics
• Channel persistence allows for ensuring end-to-end reliability
25
26
Questions?

More Related Content

Viewers also liked

2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5Samuel Rash
 
Best practice instagram
Best practice   instagramBest practice   instagram
Best practice instagramWooseung Kim
 
줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승Wooseung Kim
 
2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료Wooseung Kim
 
빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래Wooseung Kim
 
The Future of Everything
The Future of EverythingThe Future of Everything
The Future of EverythingMichael Ducy
 
Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)Wooseung Kim
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017Carol Smith
 

Viewers also liked (8)

2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
Best practice instagram
Best practice   instagramBest practice   instagram
Best practice instagram
 
줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승
 
2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료
 
빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래
 
The Future of Everything
The Future of EverythingThe Future of Everything
The Future of Everything
 
Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
 

Similar to Cloudera's Flume

Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeYahoo Developer Network
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil DubeySwapnil Dubey
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an IntroductionErik Schmiegelow
 
Flume DS -JSP.pptx
Flume DS -JSP.pptxFlume DS -JSP.pptx
Flume DS -JSP.pptxJayesh Patil
 
AdroitLogic UltraESB
AdroitLogic UltraESBAdroitLogic UltraESB
AdroitLogic UltraESBAdroitLogic
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
Module notes artificial intelligence and
Module notes artificial intelligence andModule notes artificial intelligence and
Module notes artificial intelligence andbhagyavantrajapur88
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application Apache Apex
 
Ch 10: Attacking Back-End Components
Ch 10: Attacking Back-End ComponentsCh 10: Attacking Back-End Components
Ch 10: Attacking Back-End ComponentsSam Bowne
 
Ch 13: Attacking Other Users: Other Techniques (Part 1)
Ch 13: Attacking Other Users:  Other Techniques (Part 1)Ch 13: Attacking Other Users:  Other Techniques (Part 1)
Ch 13: Attacking Other Users: Other Techniques (Part 1)Sam Bowne
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...apidays
 
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...StreamNative
 
Session 09 - Flume
Session 09 - FlumeSession 09 - Flume
Session 09 - FlumeAnandMHadoop
 

Similar to Cloudera's Flume (20)

Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Flume basic
Flume basicFlume basic
Flume basic
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Flume
FlumeFlume
Flume
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
 
Flume DS -JSP.pptx
Flume DS -JSP.pptxFlume DS -JSP.pptx
Flume DS -JSP.pptx
 
AdroitLogic UltraESB
AdroitLogic UltraESBAdroitLogic UltraESB
AdroitLogic UltraESB
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
Module notes artificial intelligence and
Module notes artificial intelligence andModule notes artificial intelligence and
Module notes artificial intelligence and
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Ch 10: Attacking Back-End Components
Ch 10: Attacking Back-End ComponentsCh 10: Attacking Back-End Components
Ch 10: Attacking Back-End Components
 
Ch 13: Attacking Other Users: Other Techniques (Part 1)
Ch 13: Attacking Other Users:  Other Techniques (Part 1)Ch 13: Attacking Other Users:  Other Techniques (Part 1)
Ch 13: Attacking Other Users: Other Techniques (Part 1)
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
 
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
 
Session 09 - Flume
Session 09 - FlumeSession 09 - Flume
Session 09 - Flume
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Cloudera's Flume

  • 1. 1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Real Time Data Ingest into Hadoop using Flume Hari Shreedharan Committer and PMC Member, Apache Flume Software Engineer, Cloudera
  • 2. What is Flume • Collection, Aggregation of streaming Event Data • Typically used for log data • Significant advantages over ad-hoc solutions • Reliable, Scalable, Manageable, Customizable and High Performance • Declarative, Dynamic Configuration • Contextual Routing • Feature rich • Fully extensible 2
  • 3. Core Concepts: Event An Event is the fundamental unit of data transported by Flume from its point of origination to its final destination. Event is a byte array payload accompanied by optional headers. • Payload is opaque to Flume • Headers are specified as an unordered collection of string key- value pairs, with keys being unique across the collection • Headers can be used for contextual routing 3
  • 4. Core Concepts: Client An entity that generates events and sends them to one or more Agents. • Example • Flume log4j Appender • Custom Client using Client SDK (org.apache.flume.api) • Embedded Agent – An agent embedded within your application • Decouples Flume from the system where event data is consumed from • Not needed in all cases 4
  • 5. Core Concepts: Agent A container for hosting Sources, Channels, Sinks and other components that enable the transportation of events from one place to another. • Fundamental part of a Flume flow • Provides Configuration, Life-Cycle Management, and Monitoring Support for hosted components 5
  • 6. Typical Aggregation Flow 6 [Client]+  Agent [ Agent]*  Destination
  • 7. Core Concepts: Source An active component that receives events from a specialized location or mechanism and places it on one or Channels. • Different Source types: • Specialized sources for integrating with well-known systems. Example: Syslog, Netcat • Auto-Generating Sources: Exec, SEQ • IPC sources for Agent-to-Agent communication: Avro • Require at least one channel to function 7
  • 8. Sources • Different Source types: • Specialized sources for integrating with well-known systems. Example: Spooling Files, Syslog, Netcat, JMS • Auto-Generating Sources: Exec, SEQ • IPC sources for Agent-to-Agent communication: Avro, Thrift • Require at least one channel to function 8
  • 9. Core Concepts: Channel A passive component that buffers the incoming events until they are drained by Sinks. • Different Channels offer different levels of persistence: • Memory Channel: volatile • Data lost if JVM or machine restarts • File Channel: backed by WAL implementation • Data not lost unless the disk dies. • Eventually, when the agent comes back data can be accessed. • Channels are fully transactional • Provide weak ordering guarantees • Can work with any number of Sources and Sinks. 9
  • 10. Core Concepts: Sink An active component that removes events from a Channel and transmits them to their next hop destination. • Different types of Sinks: • Terminal sinks that deposit events to their final destination. For example: HDFS, HBase, Morphline-Solr, Elastic Search • Sinks support serialization to user’s preferred formats. • HDFS sink supports time-based and arbitrary bucketing of data while writing to HDFS. • IPC sink for Agent-to-Agent communication: Avro, Thrift • Require exactly one channel to function 10
  • 11. Flow Reliability Reliability based on: • Transactional Exchange between Agents • Persistence Characteristics of Channels in the Flow Also Available: • Built-in Load balancing Support • Built-in Failover Support 11
  • 12. Flow Reliability Normal Flow Communication Failure between Agents Communication Restored, Flow back to Normal 12
  • 13. Flow Handling Channels decouple impedance of upstream and downstream • Upstream burstiness is damped by channels • Downstream failures are transparently absorbed by channels  Sizing of channel capacity is key in realizing these benefits 13
  • 14. Configuration • Java Properties File Format # Comment line key1 = value key2 = multi-line value • Hierarchical, Name Based Configuration agent1.channels.myChannel.type = FILE agent1.channels.myChannel.capacity = 1000 • Uses soft references for establishing associations agent1.sources.mySource.type = HTTP agent1.sources.mySource.channels = myChannel 14
  • 15. Configuration Typical Deployment • All agents in a specific tier could be given the same name • One configuration file with entries for three agents can be used throughout 15
  • 16. Contextual Routing Achieved using Interceptors and Channel Selectors 16
  • 17. Interceptors Interceptor An Interceptor is a component applied to a source in pre-specified order to enable decorating and filtering of events where necessary. • Built-in Interceptors allow adding headers such as timestamps, hostname, static markers etc. • Custom interceptors can introspect event payload to create specific headers where necessary 17
  • 18. Contextual Routing Channel Selector A Channel Selector allows a Source to select one or more Channels from all the Channels that the Source is configured with based on preset criteria. • Built-in Channel Selectors: • Replicating: for duplicating the events • Multiplexing: for routing based on headers 18
  • 19. Contextual Routing • Terminal Sinks can directly use Headers to make destination selections • HDFS Sink can use headers values to create dynamic path for files that the event will be added to. • Some headers such as timestamps can be used in a more sophisticated manner • Custom Channel Selector can be used for doing specialized routing where necessary 19
  • 20. Load Balancing and Failover Sink Processor A Sink Processor is responsible for invoking one sink from an assigned group of sinks. • Built-in Sink Processors: • Load Balancing Sink Processor – using RANDOM, ROUND_ROBIN or Custom selection algorithm • Failover Sink Processor • Default Sink Processor 20
  • 21. Sink Processor • Invoked by Sink Runner • Acts as a proxy for a Sink 21
  • 22. Sink Processor Configuration • A Sink can exist in at most one group • A Sink that is not in any group is handled via Default Sink Processor • Caution: Removing a Sink Group does not make the sinks inactive! 22
  • 23. Client API • Simple API that can be used to send data to Flume agents. • Simplest form – send a batch of events to one agent. • Can be used to send data to multiple agents in a round-robin, random or failover fashion (send data to one till it fails). • Java only. • flume.thrift can be used to generate code for other languages. • Use with Thrift source. 23
  • 24. Embedded Agent • Client API throws exceptions if data could not be sent. • Applications may not be able to tolerate this. • Embedded Agent – A (limited) Flume agent within your application • Has a channel – so buffers data in-memory or on-disk till the data is sent or the channel is full. • Throws exceptions only if the channel is full (or error writing to channel). • Cushion for application if something causes data to be stuck within the application • Supports sending data to other Flume agents only, no HDFS, HBase etc. 24
  • 25. Summary • Clients send Events to Agents • Agents hosts number Flume components – Source, Interceptors, Channel Selectors, Channels, Sink Processors, and Sinks. • Sources and Sinks are active components, where as Channels are passive • Source accepts Events, passes them through Interceptor(s), and if not filtered, puts them on channel(s) selected by the configured Channel Selector • Sink Processor identifies a sink to invoke, that can take Events from a Channel and send it to its next hop destination • Channel operations are transactional to guarantee one-hop delivery semantics • Channel persistence allows for ensuring end-to-end reliability 25