SlideShare uma empresa Scribd logo
1 de 18
Kafka & 
Hadoop 
Gwen Shapira / Software Engineer
About Me 
‱ 15 years of moving data around 
‱ Formerly consultant 
‱ Now Cloudera Engineer: 
©2014 Cloudera, Inc. All rights reserved. 2 
– Flume 
– Sqoop 
– Kafka
There’s a book on that! 
©2014 Cloudera, Inc. All rights reserved. 3
We are also blogging 
©2014 Cloudera, Inc. All rights reserved. 4
5 
Getting Data from Kafka to Hadoop 
There are only 
bad options. 
It's about finding 
the best one. 
©2014 Cloudera, Inc. All rights reserved.
©2014 Cloudera, Inc. All rights reserved. 6 
Camus
©2014 Cloudera, Inc. All rights reserved. 7 
Camus 
Setup 
ZooKeeper 
Topic Offsets 
Other Systems HDFS Processes 
Task 
Task 
Task 
In process 
Avro Files 
In process 
Avro Files 
Audit Counts 
Clean Up 
Kakfa 
B 
A 
C 
D 
F 
G H 
I 
E
Missing in Action 
‱ Kafka has no MR layer 
– InputFormat, OutputFormat, Utils
 
‱ Sqoop is a generic batch ingest framework 
©2014 Cloudera, Inc. All rights reserved. 8 
– Why no Kafka?
Flume + Kafka = Flafka 
©2014 Cloudera, Inc. All rights reserved. 9
10 
How does work? 
Sources Interceptors Selectors Channels Sinks 
Flume Agent 
Twitter, logs, 
webserver, 
Kafka
 
Mask, re-format, 
validate
 
DR, critical 
Memory, file 
HDFS, 
Hbase, Solr, 
Kafka
11 
But I just want to 
get data from Kafka 
to Hbase / HDFS 
©2014 Cloudera, Inc. All rights reserved.
12 
Channels Sinks 
Kafka Channel 
Flume Agent 
Kafka! HDFS, 
Hbase, Solr
SparkStreaming 
Single Pass 
©2014 Cloudera, Inc. All rights reserved. 13 
Source 
RawInput 
DStream 
RDD 
Source 
RawInput 
DStream 
RDD 
RDD 
Filter Count Print 
Source 
RawInput 
DStream 
RDD 
RDD 
RDD 
Single Pass 
Filter Count Print 
Pre-first 
Batch 
First 
Batch 
Second 
Batch
©2014 Cloudera, Inc. All rights reserved. 14 
Storm 
Spout 
Source 
Split 
words 
bolts 
Split 
words 
bolts 
Spout 
Split 
words 
bolts 
Split 
words 
bolts 
Count 
Count 
Count 
Spout Layer Fan out Layer 1 Shuffle Layer 2
Retro Thoughts 
©2014 Cloudera, Inc. All rights reserved. 15
‱ Data often has schema 
‱ At least it should 
‱ Kafka is unaware – which is good 
‱ Need capability to figure out schema for events 
‱ Without including it in every event 
©2014 Cloudera, Inc. All rights reserved. 16 
Schema
Kafka in Cloudera Manager 
©2014 Cloudera, Inc. All rights reserved. 17
18 
Visit us 
at 
Booth 
#305 
BOOK SIGNINGS THEATER SESSIONS 
TECHNICAL DEMOS GIVEAWAYS

Mais conteĂșdo relacionado

Mais procurados

Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
JAXLondon2014
 
Big data conference europe real-time streaming in any and all clouds, hybri...
Big data conference europe   real-time streaming in any and all clouds, hybri...Big data conference europe   real-time streaming in any and all clouds, hybri...
Big data conference europe real-time streaming in any and all clouds, hybri...
Timothy Spann
 
Spark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullSpark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-full
Jim Dowling
 

Mais procurados (20)

Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and StormHow Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and Storm
 
Apache Pulsar and Github
Apache Pulsar and GithubApache Pulsar and Github
Apache Pulsar and Github
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Developing with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
 
Real time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and CouchbaseReal time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and Couchbase
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
 
Spark optimization
Spark optimizationSpark optimization
Spark optimization
 
Big data conference europe real-time streaming in any and all clouds, hybri...
Big data conference europe   real-time streaming in any and all clouds, hybri...Big data conference europe   real-time streaming in any and all clouds, hybri...
Big data conference europe real-time streaming in any and all clouds, hybri...
 
Spark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullSpark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-full
 

Destaque

Fast Data Driving Personalization - Nick Gorski
Fast Data Driving Personalization - Nick GorskiFast Data Driving Personalization - Nick Gorski
Fast Data Driving Personalization - Nick Gorski
Hakka Labs
 
Ad Personalization at Spotify: Iterative Enginering and Product Development -...
Ad Personalization at Spotify: Iterative Enginering and Product Development -...Ad Personalization at Spotify: Iterative Enginering and Product Development -...
Ad Personalization at Spotify: Iterative Enginering and Product Development -...
Hakka Labs
 

Destaque (20)

Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
Sf data mining_meetup
Sf data mining_meetupSf data mining_meetup
Sf data mining_meetup
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Fast Data Driving Personalization - Nick Gorski
Fast Data Driving Personalization - Nick GorskiFast Data Driving Personalization - Nick Gorski
Fast Data Driving Personalization - Nick Gorski
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Adaptive Data Cleansing with StreamSets and Cassandra
Adaptive Data Cleansing with StreamSets and CassandraAdaptive Data Cleansing with StreamSets and Cassandra
Adaptive Data Cleansing with StreamSets and Cassandra
 
Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data Collector
 
Ad Personalization at Spotify: Iterative Enginering and Product Development -...
Ad Personalization at Spotify: Iterative Enginering and Product Development -...Ad Personalization at Spotify: Iterative Enginering and Product Development -...
Ad Personalization at Spotify: Iterative Enginering and Product Development -...
 
Universitélang scala tools
Universitélang scala toolsUniversitélang scala tools
Universitélang scala tools
 
Maven c'est bien, SBT c'est mieux
Maven c'est bien, SBT c'est mieuxMaven c'est bien, SBT c'est mieux
Maven c'est bien, SBT c'est mieux
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
 
Les monades Scala, Java 8
Les monades Scala, Java 8Les monades Scala, Java 8
Les monades Scala, Java 8
 
Université des langages scala
Université des langages   scalaUniversité des langages   scala
Université des langages scala
 
Scala Intro
Scala IntroScala Intro
Scala Intro
 
Lagom, reactive framework
Lagom, reactive frameworkLagom, reactive framework
Lagom, reactive framework
 
Introduction Ă  Scala - Michel Schinz - January 2010
Introduction Ă  Scala - Michel Schinz - January 2010Introduction Ă  Scala - Michel Schinz - January 2010
Introduction Ă  Scala - Michel Schinz - January 2010
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 

Semelhante a Kafka & Hadoop - for NYC Kafka Meetup

Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 

Semelhante a Kafka & Hadoop - for NYC Kafka Meetup (20)

Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Hadoop operations
Hadoop operationsHadoop operations
Hadoop operations
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Webinar: The Future of Hadoop
Webinar: The Future of HadoopWebinar: The Future of Hadoop
Webinar: The Future of Hadoop
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
The State of HBase Replication
The State of HBase ReplicationThe State of HBase Replication
The State of HBase Replication
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 

Mais de Gwen (Chen) Shapira

Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
Gwen (Chen) Shapira
 
Visualizing database performance hotsos 13-v2
Visualizing database performance   hotsos 13-v2Visualizing database performance   hotsos 13-v2
Visualizing database performance hotsos 13-v2
Gwen (Chen) Shapira
 

Mais de Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
 
Ssd collab13
Ssd   collab13Ssd   collab13
Ssd collab13
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
 
Visualizing database performance hotsos 13-v2
Visualizing database performance   hotsos 13-v2Visualizing database performance   hotsos 13-v2
Visualizing database performance hotsos 13-v2
 
Flexible Design
Flexible DesignFlexible Design
Flexible Design
 

Último

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Último (20)

call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïžcall girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 

Kafka & Hadoop - for NYC Kafka Meetup

  • 1. Kafka & Hadoop Gwen Shapira / Software Engineer
  • 2. About Me ‱ 15 years of moving data around ‱ Formerly consultant ‱ Now Cloudera Engineer: ©2014 Cloudera, Inc. All rights reserved. 2 – Flume – Sqoop – Kafka
  • 3. There’s a book on that! ©2014 Cloudera, Inc. All rights reserved. 3
  • 4. We are also blogging ©2014 Cloudera, Inc. All rights reserved. 4
  • 5. 5 Getting Data from Kafka to Hadoop There are only bad options. It's about finding the best one. ©2014 Cloudera, Inc. All rights reserved.
  • 6. ©2014 Cloudera, Inc. All rights reserved. 6 Camus
  • 7. ©2014 Cloudera, Inc. All rights reserved. 7 Camus Setup ZooKeeper Topic Offsets Other Systems HDFS Processes Task Task Task In process Avro Files In process Avro Files Audit Counts Clean Up Kakfa B A C D F G H I E
  • 8. Missing in Action ‱ Kafka has no MR layer – InputFormat, OutputFormat, Utils
 ‱ Sqoop is a generic batch ingest framework ©2014 Cloudera, Inc. All rights reserved. 8 – Why no Kafka?
  • 9. Flume + Kafka = Flafka ©2014 Cloudera, Inc. All rights reserved. 9
  • 10. 10 How does work? Sources Interceptors Selectors Channels Sinks Flume Agent Twitter, logs, webserver, Kafka
 Mask, re-format, validate
 DR, critical Memory, file HDFS, Hbase, Solr, Kafka
  • 11. 11 But I just want to get data from Kafka to Hbase / HDFS ©2014 Cloudera, Inc. All rights reserved.
  • 12. 12 Channels Sinks Kafka Channel Flume Agent Kafka! HDFS, Hbase, Solr
  • 13. SparkStreaming Single Pass ©2014 Cloudera, Inc. All rights reserved. 13 Source RawInput DStream RDD Source RawInput DStream RDD RDD Filter Count Print Source RawInput DStream RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  • 14. ©2014 Cloudera, Inc. All rights reserved. 14 Storm Spout Source Split words bolts Split words bolts Spout Split words bolts Split words bolts Count Count Count Spout Layer Fan out Layer 1 Shuffle Layer 2
  • 15. Retro Thoughts ©2014 Cloudera, Inc. All rights reserved. 15
  • 16. ‱ Data often has schema ‱ At least it should ‱ Kafka is unaware – which is good ‱ Need capability to figure out schema for events ‱ Without including it in every event ©2014 Cloudera, Inc. All rights reserved. 16 Schema
  • 17. Kafka in Cloudera Manager ©2014 Cloudera, Inc. All rights reserved. 17
  • 18. 18 Visit us at Booth #305 BOOK SIGNINGS THEATER SESSIONS TECHNICAL DEMOS GIVEAWAYS

Notas do Editor

  1. This gives me a lot of perspective regarding the use of Hadoop
  2. https://gist.github.com/gwenshap/9699072
  3. Batch MapReduce job. Exactly once semantics. Run once every X minutes.
  4. A - The setup stage fetches broker urls and topic information from ZooKeeper. B - The setup stage persists information about topics and offsets in HDFS for the tasks to read. C - The tasks read the persisted information from the setup stage. D - The tasks get events from Kakfa. E - The tasks write data to a temp location in HDFS in the format defined by the user defined decoder, in this case Avro formatted files. F - The tasks move the data in the temp location to a final location when the task is cleaning up. G - The task writes out audit counts on its activities. H - A clean up stage reads all the audit counts from all the tasks. I - The clean up stage reports back to Kakfa what has been persisted.
  5. Kafka source + sink for Flume
  6. Does not require programming.
  7. Does not require programming.
  8. MicroBatch stream processing framework. Basically Spark code executed in a slightly different context and some windowing functions.
  9. Stream processing framework. Quite popular. Can be event-based or micro-batching (with Trident). Requires low level awareness of API.