SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
© comScore, Inc. Proprietary.
Using Hadoop to Process a
Trillion+ Events
Michael Brown, CTO | February 28th, 2013
© comScore, Inc. Proprietary. 2
comScore is a leading internet technology company that
provides Analytics for a Digital World™
NASDAQ SCOR
Clients 2,100+ Worldwide
Employees 1,000+
Headquarters Reston, Virginia, USA
Global Coverage Measurement from 172 Countries; 44 Markets Reported
Local Presence 32 Locations in 23 Countries
Big Data Over 1.5 Trillion Digital Interactions Captured Monthly
V0113
Vocabulary for Measuring Information
If a Grain of Sand were One Byte of Information . . .
1 Gigabyte =
1 billion bytes
patch of sand—
9” square, 1’ deep
1 Terabyte =
1 trillion bytes
a sandbox—
24’ square, 1’ deep
1 Petabyte =
1,000 terabytes
a mile long beach—
100’ wide , 1’ deep
1 Megabyte =
1 million bytes
a tablespoon of sand
1 Zetabyte =
1,000 exabytes
the same beach—
along the entire US coast
1 Exabyte =
1,000 petabytes
the same beach—
from Maine to North Carolina
1 Yottabyte =
1,000 zetabytes (24 Zeroes)
enough info to bury the entire
US under 296 feet of sand
© comScore, Inc. Proprietary.
Panel Heat Map
© comScore, Inc. Proprietary.
CENSUS
Unified Digital Measurement™ (UDM) Establishes Platform For
Panel + Census Data Integration
PANEL
Unified Digital Measurement (UDM)
Patent-Pending Methodology
Adopted by 90% of Top 100 U.S. Media Properties
Global PERSON
Measurement
Global DEVICE
Measurement
V0411
© comScore, Inc. Proprietary.
Worldwide Tags per Month
0
200,000,000,000
400,000,000,000
600,000,000,000
800,000,000,000
1,000,000,000,000
1,200,000,000,000
1,400,000,000,000
1,600,000,000,000
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
2009 2010 2011 2012 2013
#ofrecords
Panel Records Beacon Records
© comScore, Inc. Proprietary.
Beacon Heat Map
© comScore, Inc. Proprietary.
Our Event Volume in Perspective
Source: comScore MediaMetrix Worldwide December 2012
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
Top 65 WW Properties – Cumulative Page Views
© comScore, Inc. Proprietary.
Worldwide UDM™ Penetration
December 2012 Penetration Data
Europe
Austria 87%
Belgium 93%
Switzerland 89%
Germany 92%
Denmark 88%
Spain 95%
Finland 93%
France 92%
Ireland 90%
Italy 90%
Netherlands 93%
Norway 91%
Portugal 92%
Sweden 90%
United Kingdom 92%
Asia Pacific
Australia 90%
Hong Kong 95%
India 92%
Japan 82%
Malaysia 93%
New Zealand 91%
Singapore 92%
North America
Canada 94%
United States 91%
Latin America
Argentina 95%
Brazil 96%
Chile 94%
Colombia 95%
Mexico 93%
Puerto Rico 92%
Middle East & Africa
Israel 92%
South Africa 78%
Percentage of Machines Included in UDM Measurement
© comScore, Inc. Proprietary.
High Level Data Flow
Panel
Census
ETL
Delivery
© comScore, Inc. Proprietary.
Our Cluster
Production Hadoop Cluster
120 nodes: Mix of Dell 720xd, R710 and R510 servers
Each R510 has (12x2TB drives; 64GB RAM; 24 cores)
3000+ total CPUs
6.0TB total memory
2PB total disk space
Our distro is MapR M5 2.1.0
© comScore, Inc. Proprietary.
The Project:
vCE – Validated Campaign Essentials
© comScore, Inc. Proprietary.
comScore - vCE
© comScore, Inc. Proprietary.
The Problem Statement
Calculate the number of events and unique cookies for each reportable
campaign element
Key take away
Data on input will be aggregated daily
Need to process all data for 3 months
Need to calculate values for every day in the 92 day period spanning all
reportable campaign elements
© comScore, Inc. Proprietary.
Structure of the Required Output
Client Campaign Population Location Cookie Ct Period
1234 160873284 840 1 863,185 1
1234 160873284 840 1 1,719,738 2
1234 160873284 840 1 2,631,624 3
1234 160873284 840 1 3,572,163 4
1234 160873284 840 1 4,445,508 5
1234 160873284 840 1 5,308,532 6
1234 160873284 840 1 6,032,073 7
1234 160873284 840 1 6,710,645 8
1234 160873284 840 1 7,421,258 9
1234 160873284 840 1 8,154,543 10
© comScore, Inc. Proprietary.
Counting Uniques from a Time Ordered Log File
A
B
C
D
B
A
A
Major Downsides:
Need to keep all key elements in memory.
Constrained to one machine for final aggregation.
© comScore, Inc. Proprietary.
First Version
Java Map-Reduce application which processes pre-aggregated data from 92 days
Map reads the data and emits each cookie as the key of the key value pair
All 130B records go though the shuffle
Each Reducer will get all the data for a particular campaign sorted by cookie
Reducer aggregates the data by grouping key ( Client / Campaign / Population ) and calculates
unique cookies for period 1-92
Volume Grew rapidly to the point the daily processing took more than a day
© comScore, Inc. Proprietary.
M/R Data Flow
CB
Mapper MapperMapperMap Map Map
Reduce ReduceReduce
BA AC
AA BB CC
A B C
© comScore, Inc. Proprietary.
Scaling Issue
As our volume has grown we have the following stats:
Over 500 billion events per month
Daily Aggregate 1.5 billion
130 billion aggregate records for 92 days
70K Campaigns
Over 50 countries
We see 15 billion distinct cookies in a month
We only need to output 25 million rows
© comScore, Inc. Proprietary.
Basic Approach Retrospective
Processing speed is not scaling to our needs on a sample of the input data
Diagnosis
Most aggregations could not take significant advantage of combiners.
Large shuffles caused poor job performance. In some cases large aggregations ran slower on the
Hadoop cluster due to shuffle and skew in data for keys.
Diagnosis
A new approach is required to reduce the shuffle
© comScore, Inc. Proprietary.
Counting Uniques from a Key Ordered Log File
A
D
B
C
B
A
A
Major Downsides:
Need to sort data in advance.
The sort time increases as volume grows.
© comScore, Inc. Proprietary.
Counting Uniques from a Key Ordered Log File
© comScore, Inc. Proprietary.
Counting Uniques from Sharded Key Ordered Log Files
© comScore, Inc. Proprietary.
Solution to reduce the shuffle
The Problem:
Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and
job performance issues
The Idea:
Partition and sort the data by cookie on a daily basis
Create a custom InputFormat to merge daily partitions for monthly aggregations
© comScore, Inc. Proprietary.
Custom Input Format with Map Side Aggregation
CB
Mapper MapperMapperMap Map Map
Reduce ReduceReduce
BA AC
A B C
A B C
Combiner Combiner Combiner
A B C
© comScore, Inc. Proprietary.
Risks for Partitioning
Data locality
Custom InputFormat requires reading blocks of the partitioned data over the network
This was solved using a feature of the MapR file system. We created volumes and set the chunk size to
zero which guarantees that the data written to a volume will stay on one node
Map failures might result in long run times
Size of the map inputs is no longer set by block size
This was solved by creating a large number (10K) of volumes to limit the size of data processed by each
mapper
© comScore, Inc. Proprietary.
Partitioning Summary
Benefits:
A large portion of the aggregation can be completed in the map phase
Applications can now take advantage of combiners
Shuffles sizes are minimal
Results:
Took a job from 35 hours to 3 hours with no hardware changes
© comScore, Inc. Proprietary.
Useful Factoids
Visit www.comscoredatamine.com or follow @datagems for the latest gems.
Colorful, bite-sized graphical representations of the best discoveries we unearth.
© comScore, Inc. Proprietary.
Thank You!
Michael Brown
CTO
comScore, Inc.
mbrown@comscore.com
© comScore, Inc. Proprietary. 30
Diagram

Mais conteúdo relacionado

Mais procurados

Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
Distributed graph mining
Distributed graph miningDistributed graph mining
Distributed graph miningSayeed Mahmud
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
 
Big data processing with PubSub, Dataflow, and BigQuery
Big data processing with PubSub, Dataflow, and BigQueryBig data processing with PubSub, Dataflow, and BigQuery
Big data processing with PubSub, Dataflow, and BigQueryThuyen Ho
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architecturesArun Kejariwal
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Mathieu Dumoulin
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Innovating to Create a Brighter Future for AI, HPC, and Big Data
Innovating to Create a Brighter Future for AI, HPC, and Big DataInnovating to Create a Brighter Future for AI, HPC, and Big Data
Innovating to Create a Brighter Future for AI, HPC, and Big Datainside-BigData.com
 
Costing your Bug Data Operations
Costing your Bug Data OperationsCosting your Bug Data Operations
Costing your Bug Data OperationsDataWorks Summit
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 

Mais procurados (18)

Penny Pinching at Scale
Penny Pinching at ScalePenny Pinching at Scale
Penny Pinching at Scale
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
Distributed graph mining
Distributed graph miningDistributed graph mining
Distributed graph mining
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 
Big data processing with PubSub, Dataflow, and BigQuery
Big data processing with PubSub, Dataflow, and BigQueryBig data processing with PubSub, Dataflow, and BigQuery
Big data processing with PubSub, Dataflow, and BigQuery
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architectures
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Innovating to Create a Brighter Future for AI, HPC, and Big Data
Innovating to Create a Brighter Future for AI, HPC, and Big DataInnovating to Create a Brighter Future for AI, HPC, and Big Data
Innovating to Create a Brighter Future for AI, HPC, and Big Data
 
Costing your Bug Data Operations
Costing your Bug Data OperationsCosting your Bug Data Operations
Costing your Bug Data Operations
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 

Destaque

Enterprise Mobility Management
Enterprise Mobility ManagementEnterprise Mobility Management
Enterprise Mobility Managementeaiti
 
Middleware 2002
Middleware 2002Middleware 2002
Middleware 2002eaiti
 
Stateof cto career_2002
Stateof cto career_2002Stateof cto career_2002
Stateof cto career_2002eaiti
 
Thads globalsoa web2presentation2_2006
Thads globalsoa web2presentation2_2006Thads globalsoa web2presentation2_2006
Thads globalsoa web2presentation2_2006eaiti
 
Presentasi april mei cantik
Presentasi april mei cantikPresentasi april mei cantik
Presentasi april mei cantikwakafquran
 
[하종욱 설명서] IN 기아자동차
[하종욱 설명서] IN 기아자동차[하종욱 설명서] IN 기아자동차
[하종욱 설명서] IN 기아자동차Jong Uk Ha
 
PROMIS Tempus Project
PROMIS Tempus ProjectPROMIS Tempus Project
PROMIS Tempus ProjectPROMISproject
 
Tempus PROMIS Work Packages
Tempus PROMIS Work PackagesTempus PROMIS Work Packages
Tempus PROMIS Work PackagesPROMISproject
 
[FT-7][snowmantw] How to make a new functional language and make the world be...
[FT-7][snowmantw] How to make a new functional language and make the world be...[FT-7][snowmantw] How to make a new functional language and make the world be...
[FT-7][snowmantw] How to make a new functional language and make the world be...Functional Thursday
 
Nitesh Tiwari resume
Nitesh Tiwari resumeNitesh Tiwari resume
Nitesh Tiwari resumeNitesh Tiwari
 
It outsourcing 2005
It outsourcing 2005It outsourcing 2005
It outsourcing 2005eaiti
 
แบบสรุปข้อมูลปรองดองอำเภอแม่ใจ 2557
แบบสรุปข้อมูลปรองดองอำเภอแม่ใจ 2557แบบสรุปข้อมูลปรองดองอำเภอแม่ใจ 2557
แบบสรุปข้อมูลปรองดองอำเภอแม่ใจ 2557LeKy KT
 
How To: Mobile "Hello World" With Xamarin and Visual Studio 2013
How To: Mobile "Hello World" With Xamarin and Visual Studio 2013How To: Mobile "Hello World" With Xamarin and Visual Studio 2013
How To: Mobile "Hello World" With Xamarin and Visual Studio 2013IndyMobileNetDev
 
Dc roundtablesmall webservices_2002
Dc roundtablesmall webservices_2002Dc roundtablesmall webservices_2002
Dc roundtablesmall webservices_2002eaiti
 
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate CompilersFunctional Thursday
 
Cloud mz cto_roundtable
Cloud mz cto_roundtableCloud mz cto_roundtable
Cloud mz cto_roundtableeaiti
 

Destaque (20)

Vaibhav
VaibhavVaibhav
Vaibhav
 
Enterprise Mobility Management
Enterprise Mobility ManagementEnterprise Mobility Management
Enterprise Mobility Management
 
Middleware 2002
Middleware 2002Middleware 2002
Middleware 2002
 
Stateof cto career_2002
Stateof cto career_2002Stateof cto career_2002
Stateof cto career_2002
 
Thads globalsoa web2presentation2_2006
Thads globalsoa web2presentation2_2006Thads globalsoa web2presentation2_2006
Thads globalsoa web2presentation2_2006
 
Påske - Krim
Påske - KrimPåske - Krim
Påske - Krim
 
Presentasi april mei cantik
Presentasi april mei cantikPresentasi april mei cantik
Presentasi april mei cantik
 
[하종욱 설명서] IN 기아자동차
[하종욱 설명서] IN 기아자동차[하종욱 설명서] IN 기아자동차
[하종욱 설명서] IN 기아자동차
 
PROMIS Tempus Project
PROMIS Tempus ProjectPROMIS Tempus Project
PROMIS Tempus Project
 
Ford
FordFord
Ford
 
Tempus PROMIS Work Packages
Tempus PROMIS Work PackagesTempus PROMIS Work Packages
Tempus PROMIS Work Packages
 
Prashant Kumar
Prashant KumarPrashant Kumar
Prashant Kumar
 
[FT-7][snowmantw] How to make a new functional language and make the world be...
[FT-7][snowmantw] How to make a new functional language and make the world be...[FT-7][snowmantw] How to make a new functional language and make the world be...
[FT-7][snowmantw] How to make a new functional language and make the world be...
 
Nitesh Tiwari resume
Nitesh Tiwari resumeNitesh Tiwari resume
Nitesh Tiwari resume
 
It outsourcing 2005
It outsourcing 2005It outsourcing 2005
It outsourcing 2005
 
แบบสรุปข้อมูลปรองดองอำเภอแม่ใจ 2557
แบบสรุปข้อมูลปรองดองอำเภอแม่ใจ 2557แบบสรุปข้อมูลปรองดองอำเภอแม่ใจ 2557
แบบสรุปข้อมูลปรองดองอำเภอแม่ใจ 2557
 
How To: Mobile "Hello World" With Xamarin and Visual Studio 2013
How To: Mobile "Hello World" With Xamarin and Visual Studio 2013How To: Mobile "Hello World" With Xamarin and Visual Studio 2013
How To: Mobile "Hello World" With Xamarin and Visual Studio 2013
 
Dc roundtablesmall webservices_2002
Dc roundtablesmall webservices_2002Dc roundtablesmall webservices_2002
Dc roundtablesmall webservices_2002
 
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
 
Cloud mz cto_roundtable
Cloud mz cto_roundtableCloud mz cto_roundtable
Cloud mz cto_roundtable
 

Semelhante a Using Hadoop

How to Suceed in Hadoop
How to Suceed in HadoopHow to Suceed in Hadoop
How to Suceed in HadoopPrecisely
 
Concept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with TelematicsConcept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with TelematicsSeeling Cheung
 
Lars George - Unaccept the Status Quo
Lars George - Unaccept the Status Quo Lars George - Unaccept the Status Quo
Lars George - Unaccept the Status Quo WeAreEsynergy
 
New Technologies For The Sustainable Enterprise; keynote @Wharton
New Technologies For The Sustainable Enterprise; keynote @WhartonNew Technologies For The Sustainable Enterprise; keynote @Wharton
New Technologies For The Sustainable Enterprise; keynote @WhartonPaul Hofmann
 
Control m customers using big data
Control m customers using big dataControl m customers using big data
Control m customers using big dataJuliette Smit
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Germany
 
Why You Need to Move Your Website to the Cloud
Why You Need to Move Your Website to the CloudWhy You Need to Move Your Website to the Cloud
Why You Need to Move Your Website to the CloudEktron
 
DimenXional Cloud Technologies (slideshare)
DimenXional Cloud Technologies (slideshare)DimenXional Cloud Technologies (slideshare)
DimenXional Cloud Technologies (slideshare)Rick Goldstein
 
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...HostedbyConfluent
 
In memory computing principles by Mac Moore of GridGain
In memory computing principles by Mac Moore of GridGainIn memory computing principles by Mac Moore of GridGain
In memory computing principles by Mac Moore of GridGainData Con LA
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospectc-bslim
 
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...Teradata Aster
 
Getting Started with Amazon DynamoDB
Getting Started with Amazon DynamoDBGetting Started with Amazon DynamoDB
Getting Started with Amazon DynamoDBAmazon Web Services
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 
A Framework to Measure and Maximize Cloud ROI
A Framework to Measure and Maximize Cloud ROIA Framework to Measure and Maximize Cloud ROI
A Framework to Measure and Maximize Cloud ROIRightScale
 
AWS Cost Optimization
AWS Cost OptimizationAWS Cost Optimization
AWS Cost OptimizationMiles Ward
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About ShardingMongoDB
 
Building a real-time, scalable and intelligent programmatic ad buying platform
Building a real-time, scalable and intelligent programmatic ad buying platformBuilding a real-time, scalable and intelligent programmatic ad buying platform
Building a real-time, scalable and intelligent programmatic ad buying platformJampp
 
Data Boulders from Space: How DigitalGlobe Uses AWS to Manage Data
Data Boulders from Space: How DigitalGlobe Uses AWS to Manage DataData Boulders from Space: How DigitalGlobe Uses AWS to Manage Data
Data Boulders from Space: How DigitalGlobe Uses AWS to Manage DataAmazon Web Services
 

Semelhante a Using Hadoop (20)

How to Suceed in Hadoop
How to Suceed in HadoopHow to Suceed in Hadoop
How to Suceed in Hadoop
 
Concept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with TelematicsConcept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with Telematics
 
Lars George - Unaccept the Status Quo
Lars George - Unaccept the Status Quo Lars George - Unaccept the Status Quo
Lars George - Unaccept the Status Quo
 
New Technologies For The Sustainable Enterprise; keynote @Wharton
New Technologies For The Sustainable Enterprise; keynote @WhartonNew Technologies For The Sustainable Enterprise; keynote @Wharton
New Technologies For The Sustainable Enterprise; keynote @Wharton
 
Control m customers using big data
Control m customers using big dataControl m customers using big data
Control m customers using big data
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data Analytics
 
Why You Need to Move Your Website to the Cloud
Why You Need to Move Your Website to the CloudWhy You Need to Move Your Website to the Cloud
Why You Need to Move Your Website to the Cloud
 
DimenXional Cloud Technologies (slideshare)
DimenXional Cloud Technologies (slideshare)DimenXional Cloud Technologies (slideshare)
DimenXional Cloud Technologies (slideshare)
 
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...
 
In memory computing principles by Mac Moore of GridGain
In memory computing principles by Mac Moore of GridGainIn memory computing principles by Mac Moore of GridGain
In memory computing principles by Mac Moore of GridGain
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospect
 
comScore
comScorecomScore
comScore
 
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
 
Getting Started with Amazon DynamoDB
Getting Started with Amazon DynamoDBGetting Started with Amazon DynamoDB
Getting Started with Amazon DynamoDB
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
A Framework to Measure and Maximize Cloud ROI
A Framework to Measure and Maximize Cloud ROIA Framework to Measure and Maximize Cloud ROI
A Framework to Measure and Maximize Cloud ROI
 
AWS Cost Optimization
AWS Cost OptimizationAWS Cost Optimization
AWS Cost Optimization
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About Sharding
 
Building a real-time, scalable and intelligent programmatic ad buying platform
Building a real-time, scalable and intelligent programmatic ad buying platformBuilding a real-time, scalable and intelligent programmatic ad buying platform
Building a real-time, scalable and intelligent programmatic ad buying platform
 
Data Boulders from Space: How DigitalGlobe Uses AWS to Manage Data
Data Boulders from Space: How DigitalGlobe Uses AWS to Manage DataData Boulders from Space: How DigitalGlobe Uses AWS to Manage Data
Data Boulders from Space: How DigitalGlobe Uses AWS to Manage Data
 

Mais de eaiti

Handheld device med_care_2001
Handheld device med_care_2001Handheld device med_care_2001
Handheld device med_care_2001eaiti
 
Ctolinux 2001
Ctolinux 2001Ctolinux 2001
Ctolinux 2001eaiti
 
J2ee 2000
J2ee 2000J2ee 2000
J2ee 2000eaiti
 
Xp presentation 2003
Xp presentation 2003Xp presentation 2003
Xp presentation 2003eaiti
 
Push to pull
Push to pullPush to pull
Push to pulleaiti
 
Intrusion detection 2001
Intrusion detection 2001Intrusion detection 2001
Intrusion detection 2001eaiti
 
Cto forum nirav_kapadia_2006_03_31_2006
Cto forum nirav_kapadia_2006_03_31_2006Cto forum nirav_kapadia_2006_03_31_2006
Cto forum nirav_kapadia_2006_03_31_2006eaiti
 
Mobile 2000
Mobile 2000Mobile 2000
Mobile 2000eaiti
 
Dions globalsoa web2presentation1_2006
Dions globalsoa web2presentation1_2006Dions globalsoa web2presentation1_2006
Dions globalsoa web2presentation1_2006eaiti
 
Ping solutions overview_111904
Ping solutions overview_111904Ping solutions overview_111904
Ping solutions overview_111904eaiti
 
Social apps 3_1_2008
Social apps 3_1_2008Social apps 3_1_2008
Social apps 3_1_2008eaiti
 
Washdc cto-0905-2003
Washdc cto-0905-2003Washdc cto-0905-2003
Washdc cto-0905-2003eaiti
 
Broadband tech 2005
Broadband tech 2005Broadband tech 2005
Broadband tech 2005eaiti
 
Quantum technology
Quantum technologyQuantum technology
Quantum technologyeaiti
 
Hemispheres of Data
Hemispheres of DataHemispheres of Data
Hemispheres of Dataeaiti
 
Greenplum: Driving the future of Data Warehousing and Analytics
Greenplum: Driving the future of Data Warehousing and AnalyticsGreenplum: Driving the future of Data Warehousing and Analytics
Greenplum: Driving the future of Data Warehousing and Analyticseaiti
 

Mais de eaiti (16)

Handheld device med_care_2001
Handheld device med_care_2001Handheld device med_care_2001
Handheld device med_care_2001
 
Ctolinux 2001
Ctolinux 2001Ctolinux 2001
Ctolinux 2001
 
J2ee 2000
J2ee 2000J2ee 2000
J2ee 2000
 
Xp presentation 2003
Xp presentation 2003Xp presentation 2003
Xp presentation 2003
 
Push to pull
Push to pullPush to pull
Push to pull
 
Intrusion detection 2001
Intrusion detection 2001Intrusion detection 2001
Intrusion detection 2001
 
Cto forum nirav_kapadia_2006_03_31_2006
Cto forum nirav_kapadia_2006_03_31_2006Cto forum nirav_kapadia_2006_03_31_2006
Cto forum nirav_kapadia_2006_03_31_2006
 
Mobile 2000
Mobile 2000Mobile 2000
Mobile 2000
 
Dions globalsoa web2presentation1_2006
Dions globalsoa web2presentation1_2006Dions globalsoa web2presentation1_2006
Dions globalsoa web2presentation1_2006
 
Ping solutions overview_111904
Ping solutions overview_111904Ping solutions overview_111904
Ping solutions overview_111904
 
Social apps 3_1_2008
Social apps 3_1_2008Social apps 3_1_2008
Social apps 3_1_2008
 
Washdc cto-0905-2003
Washdc cto-0905-2003Washdc cto-0905-2003
Washdc cto-0905-2003
 
Broadband tech 2005
Broadband tech 2005Broadband tech 2005
Broadband tech 2005
 
Quantum technology
Quantum technologyQuantum technology
Quantum technology
 
Hemispheres of Data
Hemispheres of DataHemispheres of Data
Hemispheres of Data
 
Greenplum: Driving the future of Data Warehousing and Analytics
Greenplum: Driving the future of Data Warehousing and AnalyticsGreenplum: Driving the future of Data Warehousing and Analytics
Greenplum: Driving the future of Data Warehousing and Analytics
 

Último

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Using Hadoop

  • 1. © comScore, Inc. Proprietary. Using Hadoop to Process a Trillion+ Events Michael Brown, CTO | February 28th, 2013
  • 2. © comScore, Inc. Proprietary. 2 comScore is a leading internet technology company that provides Analytics for a Digital World™ NASDAQ SCOR Clients 2,100+ Worldwide Employees 1,000+ Headquarters Reston, Virginia, USA Global Coverage Measurement from 172 Countries; 44 Markets Reported Local Presence 32 Locations in 23 Countries Big Data Over 1.5 Trillion Digital Interactions Captured Monthly V0113
  • 3. Vocabulary for Measuring Information If a Grain of Sand were One Byte of Information . . . 1 Gigabyte = 1 billion bytes patch of sand— 9” square, 1’ deep 1 Terabyte = 1 trillion bytes a sandbox— 24’ square, 1’ deep 1 Petabyte = 1,000 terabytes a mile long beach— 100’ wide , 1’ deep 1 Megabyte = 1 million bytes a tablespoon of sand 1 Zetabyte = 1,000 exabytes the same beach— along the entire US coast 1 Exabyte = 1,000 petabytes the same beach— from Maine to North Carolina 1 Yottabyte = 1,000 zetabytes (24 Zeroes) enough info to bury the entire US under 296 feet of sand
  • 4. © comScore, Inc. Proprietary. Panel Heat Map
  • 5. © comScore, Inc. Proprietary. CENSUS Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration PANEL Unified Digital Measurement (UDM) Patent-Pending Methodology Adopted by 90% of Top 100 U.S. Media Properties Global PERSON Measurement Global DEVICE Measurement V0411
  • 6. © comScore, Inc. Proprietary. Worldwide Tags per Month 0 200,000,000,000 400,000,000,000 600,000,000,000 800,000,000,000 1,000,000,000,000 1,200,000,000,000 1,400,000,000,000 1,600,000,000,000 Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan 2009 2010 2011 2012 2013 #ofrecords Panel Records Beacon Records
  • 7. © comScore, Inc. Proprietary. Beacon Heat Map
  • 8. © comScore, Inc. Proprietary. Our Event Volume in Perspective Source: comScore MediaMetrix Worldwide December 2012 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 Top 65 WW Properties – Cumulative Page Views
  • 9. © comScore, Inc. Proprietary. Worldwide UDM™ Penetration December 2012 Penetration Data Europe Austria 87% Belgium 93% Switzerland 89% Germany 92% Denmark 88% Spain 95% Finland 93% France 92% Ireland 90% Italy 90% Netherlands 93% Norway 91% Portugal 92% Sweden 90% United Kingdom 92% Asia Pacific Australia 90% Hong Kong 95% India 92% Japan 82% Malaysia 93% New Zealand 91% Singapore 92% North America Canada 94% United States 91% Latin America Argentina 95% Brazil 96% Chile 94% Colombia 95% Mexico 93% Puerto Rico 92% Middle East & Africa Israel 92% South Africa 78% Percentage of Machines Included in UDM Measurement
  • 10. © comScore, Inc. Proprietary. High Level Data Flow Panel Census ETL Delivery
  • 11. © comScore, Inc. Proprietary. Our Cluster Production Hadoop Cluster 120 nodes: Mix of Dell 720xd, R710 and R510 servers Each R510 has (12x2TB drives; 64GB RAM; 24 cores) 3000+ total CPUs 6.0TB total memory 2PB total disk space Our distro is MapR M5 2.1.0
  • 12. © comScore, Inc. Proprietary. The Project: vCE – Validated Campaign Essentials
  • 13. © comScore, Inc. Proprietary. comScore - vCE
  • 14. © comScore, Inc. Proprietary. The Problem Statement Calculate the number of events and unique cookies for each reportable campaign element Key take away Data on input will be aggregated daily Need to process all data for 3 months Need to calculate values for every day in the 92 day period spanning all reportable campaign elements
  • 15. © comScore, Inc. Proprietary. Structure of the Required Output Client Campaign Population Location Cookie Ct Period 1234 160873284 840 1 863,185 1 1234 160873284 840 1 1,719,738 2 1234 160873284 840 1 2,631,624 3 1234 160873284 840 1 3,572,163 4 1234 160873284 840 1 4,445,508 5 1234 160873284 840 1 5,308,532 6 1234 160873284 840 1 6,032,073 7 1234 160873284 840 1 6,710,645 8 1234 160873284 840 1 7,421,258 9 1234 160873284 840 1 8,154,543 10
  • 16. © comScore, Inc. Proprietary. Counting Uniques from a Time Ordered Log File A B C D B A A Major Downsides: Need to keep all key elements in memory. Constrained to one machine for final aggregation.
  • 17. © comScore, Inc. Proprietary. First Version Java Map-Reduce application which processes pre-aggregated data from 92 days Map reads the data and emits each cookie as the key of the key value pair All 130B records go though the shuffle Each Reducer will get all the data for a particular campaign sorted by cookie Reducer aggregates the data by grouping key ( Client / Campaign / Population ) and calculates unique cookies for period 1-92 Volume Grew rapidly to the point the daily processing took more than a day
  • 18. © comScore, Inc. Proprietary. M/R Data Flow CB Mapper MapperMapperMap Map Map Reduce ReduceReduce BA AC AA BB CC A B C
  • 19. © comScore, Inc. Proprietary. Scaling Issue As our volume has grown we have the following stats: Over 500 billion events per month Daily Aggregate 1.5 billion 130 billion aggregate records for 92 days 70K Campaigns Over 50 countries We see 15 billion distinct cookies in a month We only need to output 25 million rows
  • 20. © comScore, Inc. Proprietary. Basic Approach Retrospective Processing speed is not scaling to our needs on a sample of the input data Diagnosis Most aggregations could not take significant advantage of combiners. Large shuffles caused poor job performance. In some cases large aggregations ran slower on the Hadoop cluster due to shuffle and skew in data for keys. Diagnosis A new approach is required to reduce the shuffle
  • 21. © comScore, Inc. Proprietary. Counting Uniques from a Key Ordered Log File A D B C B A A Major Downsides: Need to sort data in advance. The sort time increases as volume grows.
  • 22. © comScore, Inc. Proprietary. Counting Uniques from a Key Ordered Log File
  • 23. © comScore, Inc. Proprietary. Counting Uniques from Sharded Key Ordered Log Files
  • 24. © comScore, Inc. Proprietary. Solution to reduce the shuffle The Problem: Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and job performance issues The Idea: Partition and sort the data by cookie on a daily basis Create a custom InputFormat to merge daily partitions for monthly aggregations
  • 25. © comScore, Inc. Proprietary. Custom Input Format with Map Side Aggregation CB Mapper MapperMapperMap Map Map Reduce ReduceReduce BA AC A B C A B C Combiner Combiner Combiner A B C
  • 26. © comScore, Inc. Proprietary. Risks for Partitioning Data locality Custom InputFormat requires reading blocks of the partitioned data over the network This was solved using a feature of the MapR file system. We created volumes and set the chunk size to zero which guarantees that the data written to a volume will stay on one node Map failures might result in long run times Size of the map inputs is no longer set by block size This was solved by creating a large number (10K) of volumes to limit the size of data processed by each mapper
  • 27. © comScore, Inc. Proprietary. Partitioning Summary Benefits: A large portion of the aggregation can be completed in the map phase Applications can now take advantage of combiners Shuffles sizes are minimal Results: Took a job from 35 hours to 3 hours with no hardware changes
  • 28. © comScore, Inc. Proprietary. Useful Factoids Visit www.comscoredatamine.com or follow @datagems for the latest gems. Colorful, bite-sized graphical representations of the best discoveries we unearth.
  • 29. © comScore, Inc. Proprietary. Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com
  • 30. © comScore, Inc. Proprietary. 30 Diagram