Enviar pesquisa
Carregar
30B events a day with hadoop
•
4 gostaram
•
847 visualizações
DataWorks Summit
Seguir
Tecnologia
Negócios
Economia e finanças
Denunciar
Compartilhar
Denunciar
Compartilhar
1 de 32
Recomendados
Mitacs poster jan11_2012
Mitacs poster jan11_2012
Conrad Ng
Analyzing 1.4 trillion events with Hadoop
Analyzing 1.4 trillion events with Hadoop
DataWorks Summit
Daily livestock report apr 11 2013
Daily livestock report apr 11 2013
joseleorcasita
LRT Talks Neil Stewart Associates Student Experience
LRT Talks Neil Stewart Associates Student Experience
Mark Stubbs
Venturefest Bristol 2011, Jo Oliver, Octopus Ventures
Venturefest Bristol 2011, Jo Oliver, Octopus Ventures
Science City Bristol
Julian given 17.11.11 npcc ireland
Julian given 17.11.11 npcc ireland
Investnet
H-Town Day: Larry Kellner
H-Town Day: Larry Kellner
Houston Association of REALTORS®
Cyrela - Apresentação Institucional em Inglês
Cyrela - Apresentação Institucional em Inglês
Cyrela
Recomendados
Mitacs poster jan11_2012
Mitacs poster jan11_2012
Conrad Ng
Analyzing 1.4 trillion events with Hadoop
Analyzing 1.4 trillion events with Hadoop
DataWorks Summit
Daily livestock report apr 11 2013
Daily livestock report apr 11 2013
joseleorcasita
LRT Talks Neil Stewart Associates Student Experience
LRT Talks Neil Stewart Associates Student Experience
Mark Stubbs
Venturefest Bristol 2011, Jo Oliver, Octopus Ventures
Venturefest Bristol 2011, Jo Oliver, Octopus Ventures
Science City Bristol
Julian given 17.11.11 npcc ireland
Julian given 17.11.11 npcc ireland
Investnet
H-Town Day: Larry Kellner
H-Town Day: Larry Kellner
Houston Association of REALTORS®
Cyrela - Apresentação Institucional em Inglês
Cyrela - Apresentação Institucional em Inglês
Cyrela
The Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case Study
Web Managers Group
Mba applications report
Mba applications report
Dean Wegner of Guardian Mortgage, Arizona 602-432-6388
Pultry industry in north america
Pultry industry in north america
Usapeec
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Mike Walker
Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008
Cyrela
10 years of open access at BioMed Central
10 years of open access at BioMed Central
BioMedCentral
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
Joachim von Braun
NWA Collection
NWA Collection
masnas
Consumer Snapshot January 2013
Consumer Snapshot January 2013
Prosper Business Development
Amárach Economic Recovery Index February 2013
Amárach Economic Recovery Index February 2013
Amarach Research
Amárach Economic Recovery Index March 2013
Amárach Economic Recovery Index March 2013
Amarach Research
Pp slides
Pp slides
Labrum Accounting Service
Office property market overivew 3Q 2011-India
Office property market overivew 3Q 2011-India
realestatedelhi2011
Pink pantehrs
Pink pantehrs
Deep Mandaliya
Data Science Crash Course
Data Science Crash Course
DataWorks Summit
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Managing the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Mais conteúdo relacionado
Mais procurados
The Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case Study
Web Managers Group
Mba applications report
Mba applications report
Dean Wegner of Guardian Mortgage, Arizona 602-432-6388
Pultry industry in north america
Pultry industry in north america
Usapeec
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Mike Walker
Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008
Cyrela
10 years of open access at BioMed Central
10 years of open access at BioMed Central
BioMedCentral
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
Joachim von Braun
Mais procurados
(7)
The Rise and Rise of Mobile: a Guardian Case Study
The Rise and Rise of Mobile: a Guardian Case Study
Mba applications report
Mba applications report
Pultry industry in north america
Pultry industry in north america
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Marketing Sustainability to Businesses: Strategies & Tactics for Influencing ...
Cyrela - Company Presentation - 3th Quarter 2008
Cyrela - Company Presentation - 3th Quarter 2008
10 years of open access at BioMed Central
10 years of open access at BioMed Central
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
Commentary: Hunger Reduction with Agricultural R&D and Policy Change
Semelhante a 30B events a day with hadoop
NWA Collection
NWA Collection
masnas
Consumer Snapshot January 2013
Consumer Snapshot January 2013
Prosper Business Development
Amárach Economic Recovery Index February 2013
Amárach Economic Recovery Index February 2013
Amarach Research
Amárach Economic Recovery Index March 2013
Amárach Economic Recovery Index March 2013
Amarach Research
Pp slides
Pp slides
Labrum Accounting Service
Office property market overivew 3Q 2011-India
Office property market overivew 3Q 2011-India
realestatedelhi2011
Pink pantehrs
Pink pantehrs
Deep Mandaliya
Semelhante a 30B events a day with hadoop
(7)
NWA Collection
NWA Collection
Consumer Snapshot January 2013
Consumer Snapshot January 2013
Amárach Economic Recovery Index February 2013
Amárach Economic Recovery Index February 2013
Amárach Economic Recovery Index March 2013
Amárach Economic Recovery Index March 2013
Pp slides
Pp slides
Office property market overivew 3Q 2011-India
Office property market overivew 3Q 2011-India
Pink pantehrs
Pink pantehrs
Mais de DataWorks Summit
Data Science Crash Course
Data Science Crash Course
DataWorks Summit
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Managing the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
Mais de DataWorks Summit
(20)
Data Science Crash Course
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Último
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
LoriGlavin3
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
Curtis Poe
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
LoriGlavin3
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
BkGupta21
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
LoriGlavin3
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
hariprasad279825
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
Lars Bell
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
mohitsingh558521
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
gvaughan
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
Nicole Novielli
Training state-of-the-art general text embedding
Training state-of-the-art general text embedding
Zilliz
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Fwdays
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
DianaGray10
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
LoriGlavin3
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Mark Simos
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
BookNet Canada
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
Rick Flair
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
LoriGlavin3
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
Commit University
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
Hervé Boutemy
Último
(20)
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
Training state-of-the-art general text embedding
Training state-of-the-art general text embedding
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
30B events a day with hadoop
1.
30 Billion Events
a Day with Hadoop Michael Brown, CTO, comScore, Inc. May 10th, 2012
2.
comScore is a
Global Leader in Measuring the Digital World NASDAQ SCOR Clients 1860+ worldwide Employees 1000+ Headquarters Reston, VA 170+ countries under measurement; Global Coverage 43 markets reported Local Presence 32 locations in 23 countries © comScore, Inc. Proprietary. 2 V1011
3.
Some of our
Clients Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology © comScore, Inc. Proprietary. 3 V1011
4.
The Trusted Source
for Digital Intelligence Across Vertical Markets 9 out of the top 10 9 out of the top 10 INVESTMENT BANKS AUTO INSURERS 4 out of the top 4 11 out of the top 12 WIRELESS CARRIERS INTERNET SERVICE PROVIDERS 47 out of the top 50 14 out of the top 15 ONLINE PROPERTIES PHARMACEUTICAL COMPANIES 45 out of the top 50 11 out of the top 12 ADVERTISING AGENCIES CONSUMER FINANCE COMPANIES 9 out of the top 10 8 out of the top 10 MAJOR MEDIA COMPANIES CPG COMPANIES © comScore, Inc. Proprietary. 4 V1011
5.
Unified Digital Measurement™
(UDM) Establishes Platform For Panel + Census Data Integration Global PERSON Global DEVICE Measurement Measurement PANEL CENSUS Unified Digital Measurement (UDM) Patent-Pending Methodology Adopted by 90% of Top 100 U.S. Media Properties © comScore, Inc. Proprietary. 5 V0411
6.
Beacon Heat Map
© comScore, Inc. Proprietary. 6
7.
Worldwide Tags per
Month Monthly Records Collection 1,000,000,000,000 900,000,000,000 800,000,000,000 700,000,000,000 600,000,000,000 # of records 500,000,000,000 400,000,000,000 300,000,000,000 200,000,000,000 100,000,000,000 0 Jul Aug Sep Oct Nov Dec Jan Feb Apr Jun Jul Aug Sep Oct Nov Dec Jan Feb Apr Jun Jul Aug Sep Oct Nov Dec Jan Feb Apr Mar Mar Mar May May May 2009 2010 2011 2012 Panel Records Beacon Records © comScore, Inc. Proprietary. 7
8.
Our Event Volume
in Perspective Property Page Views (MM) FACEBOOK.COM 472,814 Google Sites 302,802 Yahoo! Sites 90,448 Total 866,064 Source: comScore MediaMetrix Worldwide April 2012 © comScore, Inc. Proprietary. 8
9.
Growth Slides 1,600,000,000,000
R² = 0.9335 1,400,000,000,000 1,200,000,000,000 1,000,000,000,000 800,000,000,000 600,000,000,000 400,000,000,000 200,000,000,000 - © comScore, Inc. Proprietary. 9
10.
The Project: Census Web
Agg © comScore, Inc. Proprietary. 10
11.
The Problem Statement §
Calculate the number of events and unique cookies for each key § Key take aways – Data on input will be sessionized daily – Need to process all data for a month – Need to calculate values for Total Internet and for each site under measurement © comScore, Inc. Proprietary. 11
12.
Counting Uniques from
a Time Ordered Log File A Major Downsides: Need to keep all key elements in memory. D Constrained to one machine for final aggregation. B C B A A © comScore, Inc. Proprietary. 12
13.
Counting Uniques from
a Key Ordered Log File A Major Downsides: Need to sort data in advance. A The sort time increases as volume grows. A B B C D © comScore, Inc. Proprietary. 13
14.
Scaling Issue § As
our volume has grown we have the following stats: – Over 900 billion events per month – Over 150 billion sessions per month – Over 5,000 reportable sites – Over 50 countries – We see 15 billion distinct cookies in a month – 5 sites have over 1 billion cookies in a month – The sum of all distinct cookies is 377 billion – We only need to output 15 million rows © comScore, Inc. Proprietary. 14
15.
Counting Uniques from
a Key Ordered Log File © comScore, Inc. Proprietary. 15
16.
Windows v1 (Single
Server) § Time to process data for first few months Month Wall Time (hours) Jul 2009 8 Aug 2009 10 Sep 2009 11 Oct 2009 16 Nov 2009 37 § V1 Processed sessions at roughly 250K rows/sec § Problems with this version: – Slow – Not Scalable – Dedicated Server – Bottleneck for delivering production © comScore, Inc. Proprietary. 16
17.
Counting Uniques from
Sharded Key Ordered Log Files © comScore, Inc. Proprietary. 17
18.
Windows v2 § Features
of this version – Distributed (32 servers) – Multithreaded – Data Localization – Very low network data transfer – Handling the data growth § The V2 code processed data over 8 million rows/sec – 1 hour for Dec 2009; 5 hours for April 2012 § Issues – Data is distributed by ID into 64 parts – Possibilities for skew in distribution key, that impacts performance and high disk usage on a node – All data replication is manual, along with recovery – Results cannot be calculated if any node is down – Adding new servers or change in parts is a ton of effort – Overhead to maintain framework to run distributed jobs © comScore, Inc. Proprietary. 18
19.
Enter the Elephant §
Why Hadoop? – Scalable – Low risk to lose data due to replication – Run on a shared production cluster – No overhead to maintain framework – Easy job submission and management © comScore, Inc. Proprietary. 19
20.
Basic Approach § Leverage
Pig for POC – Pig Latin is easy for developers and data analysts to learn – Rapid application development vs. M/R applications (i.e. 1 line of Pig Latin = 20 lines in Java Map/ Reduce) – Extendable via UDFs © comScore, Inc. Proprietary. 20
21.
Performance of Basic
Approach on Various Samples Aggregation Performance 80.00 70.00 60.00 50.00 Time (minutes) 40.00 30.00 20.00 10.00 0.00 372 GB (3%) 744 GB (6%) 1116 GB (9%) Input data size © comScore, Inc. Proprietary. 21 Note: Target data size is over 10 TB
22.
M/R Data Flow
B C A B C A Mapper Map Mapper Mapper Map Map A A B B C C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary. 22
23.
Basic Approach Retrospective §
Processing speed is not scaling to our needs on a sample of the input data § Diagnosis – Most aggregations could not take significant advantage of combiners. Not a Pig issue. – Large shuffles caused poor job performance. In some cases large aggregations ran slower on the Hadoop cluster compared to the current architecture § Diagnosis – A new approach is required to reduce the shuffle © comScore, Inc. Proprietary. 23
24.
Solution to reduce
the shuffle § The Problem: – Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and job performance issues § The Idea: – Partition and sort data on a daily basis – Create a custom input format to merge daily partitions for monthly aggregations © comScore, Inc. Proprietary. 24
25.
Custom Input Format
with Map Side Aggregation B C A B C A A Mapper Map B Mapper Map C Mapper Map Combiner Combiner Combiner A B C Reduce Reduce Reduce A B C © comScore, Inc. Proprietary. 25
26.
Performance of v2
on Various Samples Aggregation Performance 120.00 100.00 80.00 Time (minutes) 60.00 40.00 20.00 0.00 372 GB (3%) 744 GB (6%) 1116 GB (9%) 10304 GB (100%) Input data size Pig Custom Input Format © comScore, Inc. Proprietary. 26
27.
Partitioning Summary § Benefits:
– A large portion of the aggregation can be completed in the map phase – Applications can now take advantage of combiners – Shuffles sizes are minimal § Risks: – Data locality loss – Map failures might result in long run times. This is dependent on the size of the partitions © comScore, Inc. Proprietary. 27
28.
Full Sample Performance §
Full set of data analysis – 10 TB of input data – 150 billion session rows § Total Time – 1 hour, 45 minutes – Over 23,000,000 rows/sec © comScore, Inc. Proprietary. 28
29.
Future Ideas § HBase
– Unique cookie calculations are free as data is more organized – How will data loading fare? § Data Locality – Ideally it would be great to provide additional clues to the storage of the data – Not sure if it will be included in Hadoop § Connection to a MPP DB – We also leverage Greenplum DB, we could connect to each sharded instance © comScore, Inc. Proprietary. 29
30.
Hadoop Cluster § Production
Hadoop Cluster – 80 nodes: Mix of Dell R710 and R510 – Each R510 has (12x2TB drives; 64GB RAM; 24 cores) – 1768 total CPUs – 4.7TB total memory – 1200TB total disk space – Our distro is MapR M5 1.2.7 © comScore, Inc. Proprietary. 30
31.
Useful Factoids
Colorful, bite-sized graphical representations of the best discoveries we unearth. Visit www.comscoredatamine.com or follow @datagems for the latest gems. © comScore, Inc. Proprietary. 31
32.
Thank You! Michael
Brown CTO comScore, Inc. mbrown@comscore.com © comScore, Inc. Proprietary. 32