Mais conteúdo relacionado Semelhante a Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month (20) Mais de Teradata Aster (20) Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month1. A Plan for Large Scale Data Analytics
Utilizing Aster nCluster to support processing in excess of 100 Billion
rows per month
Will Duckworth, Director Software Engineering (wduckworth@comscore.com)
2. Agenda
comScore
– Introduction and Technology
MM360 Initiative
The Challenge
Our Analysis
Plans
© comScore, Inc. Proprietary and Confidential. 2
3. comScore, Inc.
Founded in 1999
Publically traded on NASDAQ (SCOR)
Acquired MediaMetrix in 2002, M:Metrics in 2007, Certifica in 2009, and ARSgroup
in 2010
Corporate headquarters: Reston, VA
– Offices in Chicago, NYC, San Francisco, Seattle, Toronto, London, Tokyo, and Paris
– 500+ full-time employees
Experienced senior leadership team with a unique record of innovation in the
market research industry
More than 1200 clients across many industries
© comScore, Inc. Proprietary and Confidential. 3
4. Advising Hundreds of Leading Businesses
(partial list)
Internet Agencies Telecom Financial Retail Travel CPG Pharma Technology
© comScore, Inc. Proprietary and Confidential. 4
5. Powerful Platform: Massive Database and Cost
Effective Technology Infrastructure
Continuous Massive
Operation Operational Scale
■ 24/7 Largest Windows Data
■ 99.99% Uptime Warehouse in the World
■ 1,000 TB of
Patents storage
■ 1,100 Servers
■ 3 Issued Database and ■ 30 TB per month
■ 24 Pending Computational Infrastructure
Cost Effective Highly Scalable,
Proprietary
Distributed Processing
Technology with
System Capex Architecture
Strong IP Protection
< $7M/Year
Sophisticated Technology to Keep Up With Internet Advancements
© comScore, Inc. Proprietary and Confidential. 5
6. Even for us this is getting big…
New Rows per Day (panel vs. non-panel)
12,000
Millions
10,000
8,000
6,000
4,000
2,000
0
6/24/2009 7/24/2009 8/24/2009 9/24/2009 10/24/2009 11/24/2009 12/24/2009 1/24/2010 2/24/2010 3/24/2010
beacon panel
© comScore, Inc. Proprietary and Confidential. 6
7. Where we come from …
Our skill set came from a need to measure Win32
We chose technologies and built a core team around our
mandate to have accurate consumer Internet
measurement
– All Intel Based
– 2/3 Microsoft OS, 1/3 Linux OS
– C++
Now very much a “best tool for the job” organization
© comScore, Inc. Proprietary and Confidential. 7
9. Internet = “The Most Measurable Medium”
100% Accurate count of
server requests, but…
How many real users?
What kind of users are they?
Which request is a valid
Page View?
How long did the users spend
on my site?
© comScore, Inc. Proprietary and Confidential. 9
10. Basic Problem with Servers: No Unique User ID
Web Analytics Approximation
Unique User = Cookie ID (if Cookies can be set)
or IP Address + User Agent
Sounds Simple, But Major Problems:
Cookies are deleted frequently, and the same person can be counted
multiple times
IP Addresses change frequently causing inflation of user counts
In any case, servers identify a machine (or a browser), which can
represent multiple persons or a fraction of the usage of a single person
© comScore, Inc. Proprietary and Confidential. 10
11. Media Metrix 360: Key Benefits for Participating Sites
Comprehensive coverage: 100% of activity
– New “Universe Report” covers mobile and public machines
– Census-adjusted metrics in current Media Metrix reports (Home and
Work)
– Coverage Calculation for beaconing sites
Improved coverage of At-Work population
Harmonization / Reconciliation of panel vs. server
More granularity
More timely reporting
Transparency
© comScore, Inc. Proprietary and Confidential. 11
12. The Challenge
© comScore, Inc. Proprietary and Confidential. 12
13. Goals
Be able to scale to support an initial monthly volume of 160 Billion
records
– Store 3 months of data online
Be able to add incrementally to the environment to support growth
Support advanced analytics
– 150 analysts
Support end user access to record level data, preferably through a
SQL interface
Support the storage of row level data
Have yesterdays data available today
© comScore, Inc. Proprietary and Confidential. 13
14. Existing Internal Systems
NGUA
– Ability to run specific queries for a given time period very quickly because all processing is
parallelized
– Currently holding 560+ days of data; 800B+ rows.
– All traffic for a machine for a month – 1 minute run time (140k records)
– All traffic for pizzahut.com for a month – 4 minutes run time (1.9 million records)
– All traffic from google.com where toys is in the URL – 1 hour 15 minutes (400k records)
■ Fusion
– Primary System used for processing and providing the data behind the majority of
comScore’s products and analysis
– Runs on 32 servers
– For one month we read over 8TB of compressed log files with over 40B rows
– Produces 1.3 B rows and 120 GB of output for load into a DW
– Can turn around the processing in less than 8 hours
Both systems leverage the same core concepts of locality to data and distributed
processing
© comScore, Inc. Proprietary and Confidential. 14
15. Aster Data nCluster
Current Aster environments
– Dev: 1 Queen; 3 Workers; 650+GB total storage
– Prod: 1 Queen; 4 Loaders; 10 Workers; 32TB total storage
Plans
– Building new Prod environment 1 Queen, 70 workers and 10
Loaders / Staging servers
– 350TB total storage
– 432 Cores
© comScore, Inc. Proprietary and Confidential. 15
16. Aster Data nCluster
Table design is key with data of this size
– What is the end user going to do 80% of the time?
On the web, no matter how clean you think your data set
is there are still going to be issues
– 6 Sigma on 10 billion records a day is still nearly 35,000
“bad rows”
– Staging Servers
Looking at using Aster-Hadoop Data Connector for
integration with in-house Hadoop environment
– Aster Data for the analysts
– Hadoop for the developers
© comScore, Inc. Proprietary and Confidential. 16
17. Critical Cost Drivers to factor into the Analysis
Data Centers
– Power is the big issue at data centers today. All allocations
for power and space are based on the number of circuits and
the cost per circuit are all expected to rise
Servers
– Even high end servers have reached relative commodity
prices if you stay to the 2U footprint and standard
components
© comScore, Inc. Proprietary and Confidential. 17