SlideShare uma empresa Scribd logo
1 de 32
30 Billion Events a Day with Hadoop




Michael Brown, CTO, comScore, Inc.
May 10th, 2012
comScore is a Global Leader in Measuring the Digital World

                                                  NASDAQ            SCOR
                                                  Clients           1860+ worldwide
                                                  Employees         1000+
                                                  Headquarters      Reston, VA
                                                                    170+ countries under measurement;
                                                  Global Coverage
                                                                    43 markets reported

                                                  Local Presence    32 locations in 23 countries




                © comScore, Inc.   Proprietary.             2                                      V1011
Some of our Clients
 Media   Agencies   Telecom/Mobile            Financial   Retail   Travel   CPG   Pharma   Technology




                       © comScore, Inc.   Proprietary.      3                                   V1011
The Trusted Source for Digital Intelligence Across Vertical Markets


       9   out of the top   10                               9 out of the top 10
       INVESTMENT BANKS                                      AUTO INSURERS


       4   out of the top   4                                11   out of the top   12
       WIRELESS CARRIERS                                     INTERNET SERVICE
                                                             PROVIDERS

       47 out of the top 50                                  14   out of the top   15
       ONLINE PROPERTIES                                     PHARMACEUTICAL
                                                             COMPANIES

       45    out of the top     50                           11   out of the top   12
       ADVERTISING AGENCIES                                  CONSUMER FINANCE
                                                             COMPANIES

       9 out of the top 10                                   8   out of the top   10
       MAJOR MEDIA COMPANIES                                 CPG COMPANIES


                       © comScore, Inc.   Proprietary.   4                              V1011
Unified Digital Measurement™ (UDM) Establishes Platform For
Panel + Census Data Integration


     Global PERSON                                              Global DEVICE
      Measurement                                               Measurement




         PANEL                                                          CENSUS




             Unified Digital Measurement (UDM)
                                Patent-Pending Methodology
                      Adopted by 90% of Top 100 U.S. Media Properties


                 © comScore, Inc.   Proprietary.   5                             V0411
Beacon Heat Map




              © comScore, Inc.   Proprietary.   6
Worldwide Tags per Month

                                                                        Monthly Records Collection
               1,000,000,000,000


                900,000,000,000


                800,000,000,000


                700,000,000,000


                600,000,000,000
# of records




                500,000,000,000


                400,000,000,000


                300,000,000,000


                200,000,000,000


                100,000,000,000


                              0
                                   Jul
                                         Aug
                                               Sep
                                                     Oct
                                                           Nov
                                                                 Dec
                                                                       Jan
                                                                              Feb


                                                                                          Apr


                                                                                                      Jun
                                                                                                            Jul
                                                                                                                  Aug
                                                                                                                        Sep
                                                                                                                              Oct
                                                                                                                                    Nov
                                                                                                                                          Dec
                                                                                                                                                Jan
                                                                                                                                                      Feb


                                                                                                                                                                  Apr


                                                                                                                                                                              Jun
                                                                                                                                                                                    Jul
                                                                                                                                                                                          Aug
                                                                                                                                                                                                Sep
                                                                                                                                                                                                      Oct
                                                                                                                                                                                                            Nov
                                                                                                                                                                                                                  Dec
                                                                                                                                                                                                                        Jan
                                                                                                                                                                                                                              Feb


                                                                                                                                                                                                                                          Apr
                                                                                    Mar




                                                                                                                                                            Mar




                                                                                                                                                                                                                                    Mar
                                                                                                May




                                                                                                                                                                        May




                                                                                                                                                                                                                                                May
                                               2009                                                   2010                                                                    2011                                              2012

                                                                       Panel Records                    Beacon Records
                                                     © comScore, Inc.        Proprietary.                     7
Our Event Volume in Perspective

                                                   Property            Page Views (MM)

                            FACEBOOK.COM                                       472,814

                                            Google Sites                       302,802

                                            Yahoo! Sites                        90,448

                                                           Total               866,064




Source: comScore MediaMetrix Worldwide April 2012




                         © comScore, Inc.   Proprietary.           8
Growth Slides
1,600,000,000,000


                                                          R² = 0.9335
1,400,000,000,000



1,200,000,000,000



1,000,000,000,000



 800,000,000,000



 600,000,000,000



 400,000,000,000



 200,000,000,000



               -




                    © comScore, Inc.   Proprietary.   9
The Project:
Census Web Agg




           © comScore, Inc.   Proprietary.   10
The Problem Statement

§  Calculate the number of events and unique cookies for each key
§  Key take aways
  –  Data on input will be sessionized daily
  –  Need to process all data for a month
  –  Need to calculate values for Total Internet and for each site under
    measurement




                     © comScore, Inc.   Proprietary.   11
Counting Uniques from a Time Ordered Log File



         A                                                Major Downsides:
                                              Need to keep all key elements in memory.
         D                                 Constrained to one machine for final aggregation.


         B

         C

         B

         A

         A


               © comScore, Inc.   Proprietary.       12
Counting Uniques from a Key Ordered Log File



         A                                                   Major Downsides:
                                                       Need to sort data in advance.
         A                                       The sort time increases as volume grows.


         A

         B

         B

         C

         D


               © comScore, Inc.   Proprietary.     13
Scaling Issue

§  As our volume has grown we have the following stats:
  –  Over 900 billion events per month
  –  Over 150 billion sessions per month
  –  Over 5,000 reportable sites
  –  Over 50 countries
  –  We see 15 billion distinct cookies in a month
  –  5 sites have over 1 billion cookies in a month
  –  The sum of all distinct cookies is 377 billion
  –  We only need to output 15 million rows




                     © comScore, Inc.   Proprietary.   14
Counting Uniques from a Key Ordered Log File




               © comScore, Inc.   Proprietary.   15
Windows v1 (Single Server)

§  Time to process data for first few months
       Month                                Wall Time (hours)

      Jul 2009                              8
      Aug 2009                          10
      Sep 2009                          11
      Oct 2009                          16
      Nov 2009                          37




§  V1 Processed sessions at roughly 250K rows/sec


§  Problems with this version:
  –  Slow
  –  Not Scalable
  –  Dedicated Server
  –  Bottleneck for delivering production


                         © comScore, Inc.   Proprietary.        16
Counting Uniques from Sharded Key Ordered Log Files




               © comScore, Inc.   Proprietary.   17
Windows v2

§  Features of this version
  –  Distributed (32 servers)
  –  Multithreaded
  –  Data Localization
  –  Very low network data transfer
  –  Handling the data growth

§  The V2 code processed data over 8 million rows/sec
  –  1 hour for Dec 2009; 5 hours for April 2012

§  Issues
  –  Data is distributed by ID into 64 parts
  –  Possibilities for skew in distribution key, that impacts performance and high disk usage on a node
  –  All data replication is manual, along with recovery
  –  Results cannot be calculated if any node is down
  –  Adding new servers or change in parts is a ton of effort
  –  Overhead to maintain framework to run distributed jobs




                          © comScore, Inc.   Proprietary.   18
Enter the Elephant

§  Why Hadoop?
 –  Scalable
 –  Low risk to lose data due to replication
 –  Run on a shared production cluster
 –  No overhead to maintain framework
 –  Easy job submission and management




                   © comScore, Inc.   Proprietary.   19
Basic Approach

§  Leverage Pig for POC
  –  Pig Latin is easy for developers and data analysts to learn
  –  Rapid application development vs. M/R applications (i.e. 1 line of Pig Latin = 20 lines in Java Map/
    Reduce)
  –  Extendable via UDFs




                         © comScore, Inc.   Proprietary.   20
Performance of Basic Approach on Various Samples

                                                  Aggregation Performance
                 80.00


                 70.00


                 60.00


                 50.00
Time (minutes)




                 40.00


                 30.00


                 20.00


                 10.00


                  0.00
                         372 GB (3%)                              744 GB (6%)                                  1116 GB (9%)
                                                                 Input data size




                               © comScore, Inc.   Proprietary.   21     Note: Target data size is over 10 TB
M/R Data Flow


       B    C                                         A        B       C       A



     Mapper
       Map                                            Mapper           Mapper
                                                        Map              Map


        A       A                                         B        B       C       C

      Reduce                                          Reduce               Reduce

            A                                                 B                C




                    © comScore, Inc.   Proprietary.           22
Basic Approach Retrospective

§  Processing speed is not scaling to our needs on a sample of the input data
§  Diagnosis
  –  Most aggregations could not take significant advantage of combiners. Not a Pig issue.
  –  Large shuffles caused poor job performance. In some cases large aggregations ran slower on the
    Hadoop cluster compared to the current architecture


§  Diagnosis
  –  A new approach is required to reduce the shuffle




                        © comScore, Inc.   Proprietary.   23
Solution to reduce the shuffle

§  The Problem:
  –  Most aggregations within comScore can not take advantage of combiners, leading to large shuffles
     and job performance issues

§  The Idea:
  –  Partition and sort data on a daily basis
  –  Create a custom input format to merge daily partitions for monthly aggregations




                         © comScore, Inc.   Proprietary.   24
Custom Input Format with Map Side Aggregation


       B       C                                       A        B    C    A



   A Mapper
       Map                                           B Mapper
                                                         Map        C Mapper
                                                                        Map

     Combiner                                        Combiner        Combiner

           A                                               B               C

       Reduce                                          Reduce            Reduce

           A                                               B               C

                   © comScore, Inc.   Proprietary.         25
Performance of v2 on Various Samples

                                                       Aggregation Performance
                 120.00



                 100.00



                  80.00
Time (minutes)




                  60.00



                  40.00



                  20.00



                   0.00
                          372 GB (3%)                             744 GB (6%)                     1116 GB (9%)   10304 GB (100%)
                                                                                Input data size


                                                                      Pig   Custom Input Format



                                    © comScore, Inc.   Proprietary.             26
Partitioning Summary

§  Benefits:
  –  A large portion of the aggregation can be completed in the map phase
  –  Applications can now take advantage of combiners
  –  Shuffles sizes are minimal

§  Risks:
  –  Data locality loss
  –  Map failures might result in long run times. This is dependent on the size of the partitions




                          © comScore, Inc.   Proprietary.   27
Full Sample Performance

§  Full set of data analysis
  –  10 TB of input data
  –  150 billion session rows


§  Total Time
  –  1 hour, 45 minutes
  –  Over 23,000,000 rows/sec




                    © comScore, Inc.   Proprietary.   28
Future Ideas

§  HBase
  –  Unique cookie calculations are free as data is more organized
  –  How will data loading fare?


§  Data Locality
  –  Ideally it would be great to provide additional clues to the storage of the data
  –  Not sure if it will be included in Hadoop


§  Connection to a MPP DB
  –  We also leverage Greenplum DB, we could connect to each sharded instance




                    © comScore, Inc.   Proprietary.   29
Hadoop Cluster

§  Production Hadoop Cluster
  –  80 nodes: Mix of Dell R710 and R510
  –  Each R510 has (12x2TB drives; 64GB RAM; 24 cores)
  –  1768 total CPUs
  –  4.7TB total memory
  –  1200TB total disk space
  –  Our distro is MapR M5 1.2.7




                   © comScore, Inc.   Proprietary.   30
Useful Factoids
  Colorful, bite-sized graphical representations of the best discoveries we unearth.




    Visit www.comscoredatamine.com or follow @datagems for the latest gems.


                   © comScore, Inc.   Proprietary.   31
Thank You!


 Michael Brown
 CTO
 comScore, Inc.


 mbrown@comscore.com




             © comScore, Inc.   Proprietary.   32

Mais conteúdo relacionado

Mais procurados

The Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case StudyThe Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case StudyWeb Managers Group
 
Pultry industry in north america
Pultry industry in north americaPultry industry in north america
Pultry industry in north americaUsapeec
 
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...Mike Walker
 
Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008Cyrela
 
10 years of open access at BioMed Central
10 years of open access at BioMed Central10 years of open access at BioMed Central
10 years of open access at BioMed CentralBioMedCentral
 
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
Commentary: Hunger Reduction with  Agricultural R&D and Policy  ChangeCommentary: Hunger Reduction with  Agricultural R&D and Policy  Change
Commentary: Hunger Reduction with Agricultural R&D and Policy ChangeJoachim von Braun
 

Mais procurados (7)

The Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case StudyThe Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case Study
 
Mba applications report
Mba applications reportMba applications report
Mba applications report
 
Pultry industry in north america
Pultry industry in north americaPultry industry in north america
Pultry industry in north america
 
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
 
Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008
 
10 years of open access at BioMed Central
10 years of open access at BioMed Central10 years of open access at BioMed Central
10 years of open access at BioMed Central
 
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
Commentary: Hunger Reduction with  Agricultural R&D and Policy  ChangeCommentary: Hunger Reduction with  Agricultural R&D and Policy  Change
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
 

Semelhante a 30B events a day with hadoop

Semelhante a 30B events a day with hadoop (7)

NWA Collection
NWA CollectionNWA Collection
NWA Collection
 
Consumer Snapshot January 2013
Consumer Snapshot January 2013Consumer Snapshot January 2013
Consumer Snapshot January 2013
 
Amárach Economic Recovery Index February 2013
Amárach Economic Recovery Index February 2013Amárach Economic Recovery Index February 2013
Amárach Economic Recovery Index February 2013
 
Amárach Economic Recovery Index March 2013
Amárach Economic Recovery Index March 2013Amárach Economic Recovery Index March 2013
Amárach Economic Recovery Index March 2013
 
Pp slides
Pp slidesPp slides
Pp slides
 
Office property market overivew 3Q 2011-India
Office property market overivew  3Q 2011-IndiaOffice property market overivew  3Q 2011-India
Office property market overivew 3Q 2011-India
 
Pink pantehrs
Pink pantehrsPink pantehrs
Pink pantehrs
 

Mais de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mais de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Último (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

30B events a day with hadoop

  • 1. 30 Billion Events a Day with Hadoop Michael Brown, CTO, comScore, Inc. May 10th, 2012
  • 2. comScore is a Global Leader in Measuring the Digital World NASDAQ SCOR Clients 1860+ worldwide Employees 1000+ Headquarters Reston, VA 170+ countries under measurement; Global Coverage 43 markets reported Local Presence 32 locations in 23 countries © comScore, Inc. Proprietary. 2 V1011
  • 3. Some of our Clients Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology © comScore, Inc. Proprietary. 3 V1011
  • 4. The Trusted Source for Digital Intelligence Across Vertical Markets 9 out of the top 10 9 out of the top 10 INVESTMENT BANKS AUTO INSURERS 4 out of the top 4 11 out of the top 12 WIRELESS CARRIERS INTERNET SERVICE PROVIDERS 47 out of the top 50 14 out of the top 15 ONLINE PROPERTIES PHARMACEUTICAL COMPANIES 45 out of the top 50 11 out of the top 12 ADVERTISING AGENCIES CONSUMER FINANCE COMPANIES 9 out of the top 10 8 out of the top 10 MAJOR MEDIA COMPANIES CPG COMPANIES © comScore, Inc. Proprietary. 4 V1011
  • 5. Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration Global PERSON Global DEVICE Measurement Measurement PANEL CENSUS Unified Digital Measurement (UDM) Patent-Pending Methodology Adopted by 90% of Top 100 U.S. Media Properties © comScore, Inc. Proprietary. 5 V0411
  • 6. Beacon Heat Map © comScore, Inc. Proprietary. 6
  • 7. Worldwide Tags per Month Monthly Records Collection 1,000,000,000,000 900,000,000,000 800,000,000,000 700,000,000,000 600,000,000,000 # of records 500,000,000,000 400,000,000,000 300,000,000,000 200,000,000,000 100,000,000,000 0 Jul Aug Sep Oct Nov Dec Jan Feb Apr Jun Jul Aug Sep Oct Nov Dec Jan Feb Apr Jun Jul Aug Sep Oct Nov Dec Jan Feb Apr Mar Mar Mar May May May 2009 2010 2011 2012 Panel Records Beacon Records © comScore, Inc. Proprietary. 7
  • 8. Our Event Volume in Perspective Property Page Views (MM) FACEBOOK.COM 472,814 Google Sites 302,802 Yahoo! Sites 90,448 Total 866,064 Source: comScore MediaMetrix Worldwide April 2012 © comScore, Inc. Proprietary. 8
  • 9. Growth Slides 1,600,000,000,000 R² = 0.9335 1,400,000,000,000 1,200,000,000,000 1,000,000,000,000 800,000,000,000 600,000,000,000 400,000,000,000 200,000,000,000 - © comScore, Inc. Proprietary. 9
  • 10. The Project: Census Web Agg © comScore, Inc. Proprietary. 10
  • 11. The Problem Statement §  Calculate the number of events and unique cookies for each key §  Key take aways –  Data on input will be sessionized daily –  Need to process all data for a month –  Need to calculate values for Total Internet and for each site under measurement © comScore, Inc. Proprietary. 11
  • 12. Counting Uniques from a Time Ordered Log File A Major Downsides: Need to keep all key elements in memory. D Constrained to one machine for final aggregation. B C B A A © comScore, Inc. Proprietary. 12
  • 13. Counting Uniques from a Key Ordered Log File A Major Downsides: Need to sort data in advance. A The sort time increases as volume grows. A B B C D © comScore, Inc. Proprietary. 13
  • 14. Scaling Issue §  As our volume has grown we have the following stats: –  Over 900 billion events per month –  Over 150 billion sessions per month –  Over 5,000 reportable sites –  Over 50 countries –  We see 15 billion distinct cookies in a month –  5 sites have over 1 billion cookies in a month –  The sum of all distinct cookies is 377 billion –  We only need to output 15 million rows © comScore, Inc. Proprietary. 14
  • 15. Counting Uniques from a Key Ordered Log File © comScore, Inc. Proprietary. 15
  • 16. Windows v1 (Single Server) §  Time to process data for first few months Month Wall Time (hours) Jul 2009 8 Aug 2009 10 Sep 2009 11 Oct 2009 16 Nov 2009 37 §  V1 Processed sessions at roughly 250K rows/sec §  Problems with this version: –  Slow –  Not Scalable –  Dedicated Server –  Bottleneck for delivering production © comScore, Inc. Proprietary. 16
  • 17. Counting Uniques from Sharded Key Ordered Log Files © comScore, Inc. Proprietary. 17
  • 18. Windows v2 §  Features of this version –  Distributed (32 servers) –  Multithreaded –  Data Localization –  Very low network data transfer –  Handling the data growth §  The V2 code processed data over 8 million rows/sec –  1 hour for Dec 2009; 5 hours for April 2012 §  Issues –  Data is distributed by ID into 64 parts –  Possibilities for skew in distribution key, that impacts performance and high disk usage on a node –  All data replication is manual, along with recovery –  Results cannot be calculated if any node is down –  Adding new servers or change in parts is a ton of effort –  Overhead to maintain framework to run distributed jobs © comScore, Inc. Proprietary. 18
  • 19. Enter the Elephant §  Why Hadoop? –  Scalable –  Low risk to lose data due to replication –  Run on a shared production cluster –  No overhead to maintain framework –  Easy job submission and management © comScore, Inc. Proprietary. 19
  • 20. Basic Approach §  Leverage Pig for POC –  Pig Latin is easy for developers and data analysts to learn –  Rapid application development vs. M/R applications (i.e. 1 line of Pig Latin = 20 lines in Java Map/ Reduce) –  Extendable via UDFs © comScore, Inc. Proprietary. 20
  • 21. Performance of Basic Approach on Various Samples Aggregation Performance 80.00 70.00 60.00 50.00 Time (minutes) 40.00 30.00 20.00 10.00 0.00 372 GB (3%) 744 GB (6%) 1116 GB (9%) Input data size © comScore, Inc. Proprietary. 21 Note: Target data size is over 10 TB
  • 22. M/R Data Flow B C A B C A Mapper Map Mapper Mapper Map Map A A B B C C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary. 22
  • 23. Basic Approach Retrospective §  Processing speed is not scaling to our needs on a sample of the input data §  Diagnosis –  Most aggregations could not take significant advantage of combiners. Not a Pig issue. –  Large shuffles caused poor job performance. In some cases large aggregations ran slower on the Hadoop cluster compared to the current architecture §  Diagnosis –  A new approach is required to reduce the shuffle © comScore, Inc. Proprietary. 23
  • 24. Solution to reduce the shuffle §  The Problem: –  Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and job performance issues §  The Idea: –  Partition and sort data on a daily basis –  Create a custom input format to merge daily partitions for monthly aggregations © comScore, Inc. Proprietary. 24
  • 25. Custom Input Format with Map Side Aggregation B C A B C A A Mapper Map B Mapper Map C Mapper Map Combiner Combiner Combiner A B C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary. 25
  • 26. Performance of v2 on Various Samples Aggregation Performance 120.00 100.00 80.00 Time (minutes) 60.00 40.00 20.00 0.00 372 GB (3%) 744 GB (6%) 1116 GB (9%) 10304 GB (100%) Input data size Pig Custom Input Format © comScore, Inc. Proprietary. 26
  • 27. Partitioning Summary §  Benefits: –  A large portion of the aggregation can be completed in the map phase –  Applications can now take advantage of combiners –  Shuffles sizes are minimal §  Risks: –  Data locality loss –  Map failures might result in long run times. This is dependent on the size of the partitions © comScore, Inc. Proprietary. 27
  • 28. Full Sample Performance §  Full set of data analysis –  10 TB of input data –  150 billion session rows §  Total Time –  1 hour, 45 minutes –  Over 23,000,000 rows/sec © comScore, Inc. Proprietary. 28
  • 29. Future Ideas §  HBase –  Unique cookie calculations are free as data is more organized –  How will data loading fare? §  Data Locality –  Ideally it would be great to provide additional clues to the storage of the data –  Not sure if it will be included in Hadoop §  Connection to a MPP DB –  We also leverage Greenplum DB, we could connect to each sharded instance © comScore, Inc. Proprietary. 29
  • 30. Hadoop Cluster §  Production Hadoop Cluster –  80 nodes: Mix of Dell R710 and R510 –  Each R510 has (12x2TB drives; 64GB RAM; 24 cores) –  1768 total CPUs –  4.7TB total memory –  1200TB total disk space –  Our distro is MapR M5 1.2.7 © comScore, Inc. Proprietary. 30
  • 31. Useful Factoids Colorful, bite-sized graphical representations of the best discoveries we unearth. Visit www.comscoredatamine.com or follow @datagems for the latest gems. © comScore, Inc. Proprietary. 31
  • 32. Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com © comScore, Inc. Proprietary. 32