SlideShare uma empresa Scribd logo
1 de 36
Realtime Analytics using
                    Hadoop & HBase
                                Lars George,
                      Solutions Architect @ Cloudera
                            lars@cloudera.com



Monday, July 25, 11
About Me

                      • Solutions Architect @ Cloudera
                      • Apache HBase & Whirr Committer
                      • Working with Hadoop & HBase since 2007
                      • Author of O’Reilly’s “HBase - The Definitive
                        Guide”



Monday, July 25, 11
The Application Stack
                      • Solve Business Goals
                      • Rely on Proven Building Blocks
                      • Rapid Prototyping
                       ‣ Templates, MVC, Reference
                         Implementations
                      • Evolutionary Innovation Cycles
                                 “Let there be light!”
Monday, July 25, 11
LAMP



Monday, July 25, 11
L   Linux


                      A   Apache


                      M   MySQL


                      P   PHP/Perl




Monday, July 25, 11
L   Linux


                      A   Apache


                      M   MySQL


                      M   Memcache


                      P   PHP/Perl


Monday, July 25, 11
The Dawn of Big Data
                      •   Industry verticals produce a staggering amount of data
                      •   Not only web properties, but also “brick and mortar”
                          businesses
                          ‣   Smart Grid, Bio Informatics, Financial, Telco
                      •   Scalable computation frameworks allow analysis of all the data
                          ‣   No sampling anymore
                      •   Suitable algorithms derive even more data
                          ‣   Machine learning
                      •   “The Unreasonable Effectiveness of Data”
                          ‣   More data is better than smart algorithms

Monday, July 25, 11
Hadoop

                      • HDFS + MapReduce
                      • Based on Google Papers
                      • Distributed Storage and Computation
                        Framework
                      • Affordable Hardware, Free Software
                      • Significant Adoption
Monday, July 25, 11
HDFS
                      •   Reliably store petabytes of replicated data across
                          thousands of nodes
                          ‣ Data divided into 64MB blocks, each block replicated
                            three times
                      •   Master/Slave Architecture
                          ‣ Master NameNode contains meta data
                          ‣ Slave DataNode manages block on local file system
                      •   Built on “commodity” hardware
                          ‣ No 15k RPM disks or RAID required (nor wanted!)
                          ‣ Commodity Server Hardware

Monday, July 25, 11
MapReduce
                      • Distributed programming model to reliably
                        process petabytes of data
                      • Locality of data to processing is vital
                        ‣ Run code where data resides
                      • Inspired by map and reduce functions in
                        functional programming

          Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output
Monday, July 25, 11
From Short to Long Term
                      Internet


                      LAM(M)P
                                 • Serves the Client
                                 • Stores Intermediate Data

                      Hadoop
                                 • Background Batch Processing
                                 • Stores Long-Term Data

Monday, July 25, 11
Batch Processing
                      •   Scale is Unlimited
                          ‣ Bound only by Hardware
                      •   Harness the Power of the Cluster
                          ‣ CPUs, Disks, Memory

                      •   Disks extend Memory
                          ‣ Spills represent Swapping

                      •   Trade Size Limitations with Time
                          ‣ Jobs run for a few minutes to hours, days
Monday, July 25, 11
From Batch to Realtime
                      •   “Time is Money”
                      •   Bridging the gap between batch and “now”
                      •   Realtime often means “faster than batch”
                      •   80/20 Rule
                          ‣ Hadoop solves the 80% easily
                          ‣ The remaining 20% is taking 80% of the
                            effort
                      •   Go as close as possible, don’t overdo it!

Monday, July 25, 11
Stop Gap Solutions
                      •   In Memory
                          ‣   Memcached
                          ‣   MemBase
                          ‣   GigaSpaces
                      •   Relational Databases
                          ‣   MySQL
                          ‣   PostgreSQL
                      •   NoSQL
                          ‣   Cassandra
                          ‣   HBase

Monday, July 25, 11
HBase Architecture




Monday, July 25, 11
Client Access




Monday, July 25, 11
Auto Sharding




Monday, July 25, 11
Distribution




Monday, July 25, 11
HBase Key Design




Monday, July 25, 11
Key Cardinality




Monday, July 25, 11
Fold, Store, and Shift




Monday, July 25, 11
Complemental Design #1
                      Internet
                                 • Keep Backup in HDFS
                                 • MapReduce over HDFS
                                 • Synchronize HBase
                      LAM(M)P      ‣Batch Puts
                                   ‣Bulk Import

                      Hadoop     HBase



Monday, July 25, 11
Complemental Design #2
                      Internet
                                 • Add Log Support
                                 • Synchronize HBase
                      LAM(M)P      ‣Batch Puts
                       Flume
                                   ‣Bulk Import


                      Hadoop     HBase


Monday, July 25, 11
Mitigation Planning
                      • Reliable storage has top priority
                      • Disaster Recovery
                      • HBase Backups
                        ‣ Export - but what if HBase is “down”
                        ‣ CopyTable - same issue
                        ‣ Snapshots - not available

Monday, July 25, 11
Complemental Design #3
                      Internet
                                  • Add Log Processing
                                  • Remove Direct Connection
                      LAM(M)P     • Synchronize HBase
                                    ‣Batch Puts
                       Flume        ‣Bulk Import

                                 Log
                      Hadoop                HBase
                                 Proc


Monday, July 25, 11
Facebook Insights

                      • > 20B Events per Day
                      • 1M Counter Updates per Second
                        ‣ 100 Nodes Cluster
                        ‣ 10K OPS per Node


                      Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase

Monday, July 25, 11
Collection Layer

                      • “Like” button triggers AJAX request
                      • Event written to log file using Scribe
                        ‣ Handles aggregation, delivery, file roll
                          over, etc.
                        ‣ Uses HDFS to store files
                      ✓ Use Flume or Scribe

Monday, July 25, 11
Filter Layer
                      • Ptail “follows” logs written by Scribe
                      • Aggregates from multiple logs
                      • Separates into event types
                        ‣ Sharding for future growth
                      • Facebook internal tool
                      ✓ Use Flume

Monday, July 25, 11
Batching Layer
                      • Puma batches updates
                        ‣ 1 sec, staggered
                      • Flush batch, when last is done
                      • Duration limited by key distribution
                      • Facebook internal tool
                      ✓ Use Coprocessors (0.92.0)

Monday, July 25, 11
Counters
                      •   Store counters per Domain and per URL
                          ‣ Leverage HBase increment (atomic read-modify-
                            write) feature
                      •   Each row is one specific Domain or URL
                      •   The columns are the counters for specific metrics
                      •   Column families are used to group counters by time
                          range
                          ‣ Set time-to-live on CF level to auto-expire counters
                            by age to save space, e.g., 2 weeks on “Daily
                            Counters” family

Monday, July 25, 11
Key Design
           •          Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog”
                 ‣      Helps keeping pages per site close, as HBase efficiently scans blocks
                        of sorted keys
           •          Domain Row Key =
                      MD5(Reversed Domain) + Reversed Domain
                 ‣      Leading MD5 hash spreads keys randomly across all regions for
                        load balancing reasons
                 ‣      Only hashing the domain groups per site (and per subdomain if
                        needed)
           •          URL Row Key =
                      MD5(Reversed Domain) + Reversed Domain + URL ID
                 ‣      Unique ID per URL already available, make use of it

Monday, July 25, 11
Insights Schema
  Row Key: Domain Row Key
  Columns:
          Hourly Counters CF               Daily Counters CF                     Lifetime Counters CF
   6pm 6pm            6pm   7pm                1/1    1/1   2/1
                                 ... 1/1 Total                     ...   Total      Male Female    US    ...
   Total Male          US    ...               Male   US     ...
    100   50           92    45        1000    320    670   990          10000      6780   3220   9900

  Row Key: URL Row Key
  Columns:
          Hourly Counters CF               Daily Counters CF                     Lifetime Counters CF
   6pm 6pm            6pm   7pm                1/1    1/1   2/1
                                 ... 1/1 Total                     ...   Total      Male Female    US    ...
   Total Male          US    ...               Male   US     ...
    10    5            9     4         100      20    70    99            100        8      92     100




Monday, July 25, 11
Summary
                      • Design for Use-Case
                       ‣ Read, Write, or Both?
                      • Avoid Hotspotting
                       ‣ Region and Table
                      • Manage Automatism at Scale
                       ‣ For now!

Monday, July 25, 11
Monday, July 25, 11
Monday, July 25, 11
Questions?


                      lars@cloudera.com
                      http://cloudera.com




Monday, July 25, 11

Mais conteúdo relacionado

Mais procurados

2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Content Identification using HBase
Content Identification using HBaseContent Identification using HBase
Content Identification using HBaseHBaseCon
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 

Mais procurados (19)

2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Content Identification using HBase
Content Identification using HBaseContent Identification using HBase
Content Identification using HBase
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 

Destaque

Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeGuido Schmutz
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsCloudera, Inc.
 
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - SematextHBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - SematextCloudera, Inc.
 
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponHBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponCloudera, Inc.
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...Cloudera, Inc.
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceCloudera, Inc.
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQueryDharmesh Vaya
 

Destaque (8)

Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - SematextHBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
 
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponHBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 

Semelhante a Realtime Analytics with Hadoop and HBase

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introductionScott Miao
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)guest0f8e278
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLzenyk
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Cloudera, Inc.
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Intro to NoSQL and MongoDB
Intro to NoSQL and MongoDBIntro to NoSQL and MongoDB
Intro to NoSQL and MongoDBDATAVERSITY
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summitdrewz lin
 
MAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataChristan Grant
 

Semelhante a Realtime Analytics with Hadoop and HBase (20)

Firebird meets NoSQL
Firebird meets NoSQLFirebird meets NoSQL
Firebird meets NoSQL
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
Lviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQLLviv EDGE 2 - NoSQL
Lviv EDGE 2 - NoSQL
 
Hive at booking
Hive at bookingHive at booking
Hive at booking
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Intro to NoSQL and MongoDB
Intro to NoSQL and MongoDBIntro to NoSQL and MongoDB
Intro to NoSQL and MongoDB
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
 
MAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big DataMAD Skills: New Analysis Practices for Big Data
MAD Skills: New Analysis Practices for Big Data
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 

Mais de larsgeorge

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Sciencelarsgeorge
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guidelarsgeorge
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014larsgeorge
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017larsgeorge
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Noteslarsgeorge
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Datalarsgeorge
 

Mais de larsgeorge (14)

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Big Data is not Rocket Science
Big Data is not Rocket ScienceBig Data is not Rocket Science
Big Data is not Rocket Science
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
HBase Sizing Notes
HBase Sizing NotesHBase Sizing Notes
HBase Sizing Notes
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Data
 

Último

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Último (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Realtime Analytics with Hadoop and HBase

  • 1. Realtime Analytics using Hadoop & HBase Lars George, Solutions Architect @ Cloudera lars@cloudera.com Monday, July 25, 11
  • 2. About Me • Solutions Architect @ Cloudera • Apache HBase & Whirr Committer • Working with Hadoop & HBase since 2007 • Author of O’Reilly’s “HBase - The Definitive Guide” Monday, July 25, 11
  • 3. The Application Stack • Solve Business Goals • Rely on Proven Building Blocks • Rapid Prototyping ‣ Templates, MVC, Reference Implementations • Evolutionary Innovation Cycles “Let there be light!” Monday, July 25, 11
  • 5. L Linux A Apache M MySQL P PHP/Perl Monday, July 25, 11
  • 6. L Linux A Apache M MySQL M Memcache P PHP/Perl Monday, July 25, 11
  • 7. The Dawn of Big Data • Industry verticals produce a staggering amount of data • Not only web properties, but also “brick and mortar” businesses ‣ Smart Grid, Bio Informatics, Financial, Telco • Scalable computation frameworks allow analysis of all the data ‣ No sampling anymore • Suitable algorithms derive even more data ‣ Machine learning • “The Unreasonable Effectiveness of Data” ‣ More data is better than smart algorithms Monday, July 25, 11
  • 8. Hadoop • HDFS + MapReduce • Based on Google Papers • Distributed Storage and Computation Framework • Affordable Hardware, Free Software • Significant Adoption Monday, July 25, 11
  • 9. HDFS • Reliably store petabytes of replicated data across thousands of nodes ‣ Data divided into 64MB blocks, each block replicated three times • Master/Slave Architecture ‣ Master NameNode contains meta data ‣ Slave DataNode manages block on local file system • Built on “commodity” hardware ‣ No 15k RPM disks or RAID required (nor wanted!) ‣ Commodity Server Hardware Monday, July 25, 11
  • 10. MapReduce • Distributed programming model to reliably process petabytes of data • Locality of data to processing is vital ‣ Run code where data resides • Inspired by map and reduce functions in functional programming Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output Monday, July 25, 11
  • 11. From Short to Long Term Internet LAM(M)P • Serves the Client • Stores Intermediate Data Hadoop • Background Batch Processing • Stores Long-Term Data Monday, July 25, 11
  • 12. Batch Processing • Scale is Unlimited ‣ Bound only by Hardware • Harness the Power of the Cluster ‣ CPUs, Disks, Memory • Disks extend Memory ‣ Spills represent Swapping • Trade Size Limitations with Time ‣ Jobs run for a few minutes to hours, days Monday, July 25, 11
  • 13. From Batch to Realtime • “Time is Money” • Bridging the gap between batch and “now” • Realtime often means “faster than batch” • 80/20 Rule ‣ Hadoop solves the 80% easily ‣ The remaining 20% is taking 80% of the effort • Go as close as possible, don’t overdo it! Monday, July 25, 11
  • 14. Stop Gap Solutions • In Memory ‣ Memcached ‣ MemBase ‣ GigaSpaces • Relational Databases ‣ MySQL ‣ PostgreSQL • NoSQL ‣ Cassandra ‣ HBase Monday, July 25, 11
  • 21. Fold, Store, and Shift Monday, July 25, 11
  • 22. Complemental Design #1 Internet • Keep Backup in HDFS • MapReduce over HDFS • Synchronize HBase LAM(M)P ‣Batch Puts ‣Bulk Import Hadoop HBase Monday, July 25, 11
  • 23. Complemental Design #2 Internet • Add Log Support • Synchronize HBase LAM(M)P ‣Batch Puts Flume ‣Bulk Import Hadoop HBase Monday, July 25, 11
  • 24. Mitigation Planning • Reliable storage has top priority • Disaster Recovery • HBase Backups ‣ Export - but what if HBase is “down” ‣ CopyTable - same issue ‣ Snapshots - not available Monday, July 25, 11
  • 25. Complemental Design #3 Internet • Add Log Processing • Remove Direct Connection LAM(M)P • Synchronize HBase ‣Batch Puts Flume ‣Bulk Import Log Hadoop HBase Proc Monday, July 25, 11
  • 26. Facebook Insights • > 20B Events per Day • 1M Counter Updates per Second ‣ 100 Nodes Cluster ‣ 10K OPS per Node Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase Monday, July 25, 11
  • 27. Collection Layer • “Like” button triggers AJAX request • Event written to log file using Scribe ‣ Handles aggregation, delivery, file roll over, etc. ‣ Uses HDFS to store files ✓ Use Flume or Scribe Monday, July 25, 11
  • 28. Filter Layer • Ptail “follows” logs written by Scribe • Aggregates from multiple logs • Separates into event types ‣ Sharding for future growth • Facebook internal tool ✓ Use Flume Monday, July 25, 11
  • 29. Batching Layer • Puma batches updates ‣ 1 sec, staggered • Flush batch, when last is done • Duration limited by key distribution • Facebook internal tool ✓ Use Coprocessors (0.92.0) Monday, July 25, 11
  • 30. Counters • Store counters per Domain and per URL ‣ Leverage HBase increment (atomic read-modify- write) feature • Each row is one specific Domain or URL • The columns are the counters for specific metrics • Column families are used to group counters by time range ‣ Set time-to-live on CF level to auto-expire counters by age to save space, e.g., 2 weeks on “Daily Counters” family Monday, July 25, 11
  • 31. Key Design • Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog” ‣ Helps keeping pages per site close, as HBase efficiently scans blocks of sorted keys • Domain Row Key = MD5(Reversed Domain) + Reversed Domain ‣ Leading MD5 hash spreads keys randomly across all regions for load balancing reasons ‣ Only hashing the domain groups per site (and per subdomain if needed) • URL Row Key = MD5(Reversed Domain) + Reversed Domain + URL ID ‣ Unique ID per URL already available, make use of it Monday, July 25, 11
  • 32. Insights Schema Row Key: Domain Row Key Columns: Hourly Counters CF Daily Counters CF Lifetime Counters CF 6pm 6pm 6pm 7pm 1/1 1/1 2/1 ... 1/1 Total ... Total Male Female US ... Total Male US ... Male US ... 100 50 92 45 1000 320 670 990 10000 6780 3220 9900 Row Key: URL Row Key Columns: Hourly Counters CF Daily Counters CF Lifetime Counters CF 6pm 6pm 6pm 7pm 1/1 1/1 2/1 ... 1/1 Total ... Total Male Female US ... Total Male US ... Male US ... 10 5 9 4 100 20 70 99 100 8 92 100 Monday, July 25, 11
  • 33. Summary • Design for Use-Case ‣ Read, Write, or Both? • Avoid Hotspotting ‣ Region and Table • Manage Automatism at Scale ‣ For now! Monday, July 25, 11
  • 36. Questions? lars@cloudera.com http://cloudera.com Monday, July 25, 11