SlideShare a Scribd company logo
1 of 27
Data Freeway :  Scaling Out to Realtime ,[object Object],[object Object]
Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Big Data, Big Applications / Data at Facebook ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Freeway System Diagram
Realtime Requirements ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Scribe ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Calligraphus ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
HDFS : a different use case ,[object Object],[object Object],[object Object]
HDFS : add Sync ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
HDFS : Concurrent Reads Overview ,[object Object],[object Object]
HDFS : Concurrent Reads Implementation ,[object Object],[object Object],[object Object]
HDFS : Checksum Problem ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Calligraphus: Log Writer Calligraphus Servers HDFS Scribe categories Server Server Server Category 1 Category 2 Category 3 ,[object Object]
Calligraphus (Simple) Calligraphus Servers HDFS Scribe categories Number of categories Number of servers Total number of directories x = Server Server Server Category 1 Category 2 Category 3
Calligraphus (Stream Consolidation) Calligraphus Servers HDFS Scribe categories Number of categories Total number of directories = Category 1 Category 2 Category 3 Router Router Router Writer Writer Writer ZooKeeper
ZooKeeper: Distributed Map ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],A 1 5 2 3 4 B 1 5 2 3 4 C 1 5 2 3 4 D 1 5 2 3 4 Root
ZooKeeper: Distributed Map ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Distributed Map: Performance Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Canonical Realtime Application ,[object Object],[object Object],[object Object],[object Object]
Parallel Tailer ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Canonical Realtime ptail Application
Puma Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Puma Write Path
Puma Read Path
Summary - Data Freeway ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Future Work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Fin ,[object Object]

More Related Content

What's hot

Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoHyunsik Choi
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaLucidworks
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera FieldHBaseCon
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...Michael Stack
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondGruter
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerHBaseCon
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
 

What's hot (20)

Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at Cerner
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 

Viewers also liked

2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료Wooseung Kim
 
빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래Wooseung Kim
 
줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승Wooseung Kim
 
Best practice instagram
Best practice   instagramBest practice   instagram
Best practice instagramWooseung Kim
 
The Future of Everything
The Future of EverythingThe Future of Everything
The Future of EverythingMichael Ducy
 
Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)Wooseung Kim
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017Carol Smith
 

Viewers also liked (8)

Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료
 
빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래
 
줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승
 
Best practice instagram
Best practice   instagramBest practice   instagram
Best practice instagram
 
The Future of Everything
The Future of EverythingThe Future of Everything
The Future of Everything
 
Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
 

Similar to 2011 06-30-hadoop-summit v5

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Server Monitoring (Scaling while bootstrapped)
Server Monitoring  (Scaling while bootstrapped)Server Monitoring  (Scaling while bootstrapped)
Server Monitoring (Scaling while bootstrapped)Ajibola Aiyedogbon
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopDataWorks Summit
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Scaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, GoalsScaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, Goalskamaelian
 
Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaAshish Thapliyal
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And HiveCloudera, Inc.
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksJamie Grier
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and ThenSATOSHI TAGOMORI
 

Similar to 2011 06-30-hadoop-summit v5 (20)

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Server Monitoring (Scaling while bootstrapped)
Server Monitoring  (Scaling while bootstrapped)Server Monitoring  (Scaling while bootstrapped)
Server Monitoring (Scaling while bootstrapped)
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Scaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, GoalsScaling Streaming - Concepts, Research, Goals
Scaling Streaming - Concepts, Research, Goals
 
Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and Kafka
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
 

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

2011 06-30-hadoop-summit v5

Editor's Notes

  1. Sam (35s) -Who am I? -Worked in batch computing and distributed systems for over 10 years at both research-related facilities and in the internet industry -presently work at facebook; have worked on HDFS, scribe, calligraphus, and realtime metrics Eric (?s)
  2. -We’ll give you a little context of how data fits into facebook -also what pieces of datafreeway enable realtime computations (20s)
  3. -we have a lot of data at facebook -some of these stats are probably old (6 months old) -accurate one: we handle about 250 TB of data per day with scribe-hdfs, and probably close to 300 TB total -we have both batch and realtime uses of data -we’ll focus on the components of datafreeway that enable realtime use (35s)
  4. Entry points: A1. tfb www php code (flib/core/scribe/client.php, or use Nectar) A2. binary/script that sends data to scribe directly (fbcode/scribe/if/scribe.thrift with TFramedProtocol to localhost:1456) A* : all data comes in the form category X message Policy System -manages quotas Realtime B1. raw Data Streams: ptail B2. Realtime Analytics: flagship realtime app; ptail -> Puma/HBase Batch B3. Reporting and Batch Processing: Hadoop/Hive Cluster (Daily tables and Current Tables) (1m30s)
  5. Realtime requirements Scalable: also ease of operations Reliable: Data loss sla – this means bound loss due to hw failure Fast: Data latency sla – again, in the face of slow and failing hardware Easy to use: ptail –f –time (1m30s)
  6. -scribe is reliable: handles buffering when scribeh is down -really easy to use due to being on every host + language agnostic thrift api -major reason we are doing 7 GB/s (40s)
  7. what -in a nut-shell, calligraphus is scribe-compatible server that logs to HDFS - why -decided on re-writing just part of scribe for scribeh; java made more sense -libhdfs and JNI has had not only memory leaks, but improper error return codes; ‘exists’ : not there and error both -1 Status -Hope to open source soon (1m)
  8. -this is a different use of HDFS (not a batch system) -think of each entity as a publisher of data (say they are web hosts) -the consumer of a ptail output stream is a subscriber -writer tags what it publishes with category 1, reader asks for all messages tagged with category 1 -this way, you get both a low-latency pub-sub system and reliable persistence on HDFS! (45-60s)
  9. -how do we get to HDFS as a message hub? -we first need to make sure clients can have a guarantee writes are persisted and available for read -the fsync call only happens the first sync() call; the blocks need to be persisted only once so we don ’t hit the Namenode on every client sync all -return from sync means : blocks persisted and every datanode has written the block to disk (may be in OS buffers though) -typically sync is very expensive. Experiments with HBase write-ahead-log show 50% drop in throughput with frequent sync calls (1m)
  10. -with sync, next step is to make blocks being written available for read; -recall that one of our realtime requirements is 10s latency for reading data -typical block size is 512MB -4 GB/s is split out across 1000s of files; compression -without concurrent reads, data would not be visible for 1 hour (when we roll files) -concurrent reads essential to satisfy our realtime requirements (1m)
  11. -updated FSNameSystem so that Namenode will return the targets for any file that is currently being written -updated DFSClient so that if it tries to read the block being written, gets the length from the datanode -can then start reading the data up to the length at the time datanode was queried -note: one problem here, visible length per datanode can vary, so if a datanode goes down, it ’s possible to get a different datanode with a shorter length; especially true in the case of lease recovery (truncation) -solved in 0.22 with visible length (sort of) (1m15s)
  12. -the issue here is that data and metadata can be out of sync for last 512 byte chunk -define visible length is the length of data for which metadata exists and track in memory -if we read more data than that, we know we will get a CRC error and can therefore compute the CRC to send to the client -note: we only do the on-the-fly CRC in the strictest of situation so we don ’t regress and miss disk errors -trunk solution is more elegant and was done afterwards; possible to back-port (1m)
  13. Calligraphus’ role in the Datafreeway pipeline is to take incoming data streams and persist them to HDFS Many thousands of client hosts generating log data for a number of categories and are then delivered to some randomly selected calligraphus servers => load balancing purposes Since ratio of clients to servers is very high, we can safely assume every server will receive roughly every category Question is in what way should we persist this to disk? In our system every stream written to HDFS needs to corresponds to a directory in which data is appended in the form of files To maintain proper stream semantics for downstream components, which will follow each directory as an independent stream. Thus, should avoid having multiple writers for one directory (a.k.a stream)
  14. A really simple solution is just to have every writer write each category stream independently. Problem with independent writer approach: RESOURCE INEFFICIENT AND UNSCALABLE! Number of output streams is a approximately the product of # categories and # servers! Scaling to more machines or categories takes its toll on the name node (resource bottleneck) as well as down stream components that need to read these streams on a per category basis
  15. This is the approach that we took. A more suitable solution is to do data stream consolidation to reduce the number of output streams. We do this by breaking the Calligraphus servers into two logical components: a router and writer. Then we add add an intermediate shuffle phase between the router and writer tier before writing to HDFS in order to consolidate streams. Calligraphus writers are assigned category streams or portions of category streams (for large streams) to be written. Based on this assignment, routers direct data streams to the appropriate writers This drastically reduces the number of output data streams and minimizes all of the problems we mentioned in the last slide. ZooKeeper is the core component in facilitating router-writer interactions, serving as a distributed map for routers and as a stream assignment platform for writers.
  16. ZooKeeper can be viewed as a type of light-weight distributed file system. You can put and get data from a hierarchy namespace of nodes that can be accessed like file system paths. The paths that we define consist of a category/bucket pair and these serve as the root nodes for leader elections. We use buckets in addition to categories in order to partition category streams that are too large to handle for one writer. We run leader elections under each of these paths to determine task assignment. Since elections winners may not always have enough capacity, election winners can choose to reject leadership either immediately or shed if off in the future if load changes Aggressive reader side caches to minimize ZooKeeper network IO for queries. This is ok because mappings do not frequently change after stabilizing. Policy db sync propagates new categories into the maps and adjusts numbers of buckets (on the fly)
  17. Highly available is an inherent property of ZooKeeper. We can lose a few ZooKeeper nodes and still keep going. No centralized authority (independent elections, writers independently manage load) => no single point of failure Fast map lookups a result of client caching Failover => another property of using ZooKeeper’s ephemeral nodes for leader elections Adapt to changing conditions by adding new election root nodes for new categories or by adjusting number of buckets for a category
  18. We can enter about 3000 elections in just under 30 seconds routers can run elections to find leaders on all buckets in about 5-10 seconds Stable configuration it takes a bit of time since we have slow start phase to balance load, and writers need to have a little time after each bucket acquisition to determine their load -- but this is ok for our use case since in the long run the mapping is stable. If no mapping exists for some data stream, we buffer the data until it is defined. Later on during operation, we can dandles incremental changes to mapping very quickly. We can respond to election events or failures in less than a second.
  19. -Servers log data to NFS filer in a way that consolidates the data into one or more files -tailer app then reads the data -examples: all need data in a timely fashion (30s)
  20. 2 key points Hides the fact we have many HDFS instances: user can specify a category and get a stream Checkpointing -high value of generalization of checkpoint mechanism (used in puma) (45s)
  21. Configure servers local scribed to write to scribeh Client uses ptail app to get an aggregated log stream in realtime (25s)
  22. -the goal of Puma is to provide a configurable realtime analytics platform. -customers can setup pipelines here similar to with Hive, but get the data in realtime -puma is in fact a canonical streaming app in how it uses ptail -we also leverage HBase for persistence (30s)
  23. Write Flow: Ptail delivers log lines at up to 600,000 per second Driver includes a parser and processor that filters lines and sends appropriate lines to the aggregation store Aggregation Store will update any metrics based on the parsed entry If the Driver sees a checkpoint line, it instead passes it to the checkpoint handler The checkpoint handler decides if it it should flush; if so, tells Aggregation to persist changes to Storage and then writes checkpoint data to storage Notes * Storage is an interface; have memory and Hbase; can change to other forms such as MySQL (1m15s)
  24. Read path Client makes request to a thrift server: Sever proxies the request to the Store implementation which queries HBase Performance Elapsed time typically 200-300 ms for 30 day queries 99 th percentile, cross-country, < 500ms for 30 day queries (1m)
  25. -scribe & calligraphus get data into the system -HDFS at the core -ptail provides data out -puma is our emerging streaming analytics platform (20s)
  26. -puma: basically implemented a shared write-ahead-log in an Hbase table for our specific use case -using LZO or GZIP, we can ’t seek to the end of the uncompressed stream -solution: provide a basic container structure with information about compressed/uncompressed blocks. -each 1M block contains a header with a list of compressed/uncompressed offset in stream (1m)