Our need for better scalability in processing weblogs is illustrated by the change in requirements - processing 250 million vs. 1 billion web events a day (and growing). The Data Waregoup at CBSi has been transitioning core processes to re-architected hadoop processes for two years. We will cover strategies used for successfully transitioning core ETL processes to big data capabilities and present a how-to guide of re-architecting a mission critical Data Warehouse environment while it's running.
3. About me
» Worked at AT&T for 16 years building large scale
Financial Systems: Billers, GL, Data Warehouse,
Cost Modeling Systems for Network Services,
Business and Consumer Business Units
» Agency.com: several key clients – USSB, DIRECTV,
GMACCM, Keyspan
» CNET/CBSi – Director of Data Warehouse ETL
systems. Have worked at CNET/CBSi since 2005 re-
architecting Data Warehouse systems.
3
5. Top 20 Web Property
Global Unique User Ranking (000) US Unique User Ranking (000)
1 Google Sites 1,066,695 1 Google Sites 184,582
2 Microsoft Sites 914,237 2 Microsoft Sites 178,014
3Facebook 769,655 3 Yahoo! Sites 177,123
4 Yahoo! Sites 701,378 4Facebook 163,021
5 Wikimedia Sites 454,529 5 AOL, Inc. 105,861
6 Amazon Sites 319,548 6 Amazon Sites 103,709
7 Apple Inc. 264,537 7 Ask Network 91,994
8TencentInc. 245,220 8 Turner Digital 89,981
9 CBS Interactive 242,571 9 Glam Media 88,303
10 Ask Network 240,805 10 Wikimedia Sites 83,836
11 CBS Interactive 83,463
Source: Global Ranking based on comScore Worldwide MediaMetrix for the month of September, 2011. US ranking based on
5 comScore US MediaMetrix for the month of September 2011.
6. Data Warehouse at CBSi
Data
Warehouse
External
Data
Sources
Web Events/ Click-stream Internal Systems/
Content Management
6
7. Intro – Business functions we support
» Web site/media metrics (BI)
» Website re-design A/B testing
» Financial billers (download, clicks, partners, ads)
» Ad event tracking
» Data feeds for sites
» External reporting
» Custom event tracking servers
– clicks, page views, downloads,
– streaming video events, ad events, etc.
7
9. Intro - Data Warehouse Back-End
» Some interesting facts:
- Run over 800 jobs a day
- Peak days are over 500 million events per day with current
processing, next quarter will be 1 billion per day
- Events can spike at 30,000 per second
- Build/maintain over 150 dimensions
- Build 10 facts tables
- Make detailed data available for >24 months retention, up from
2 months previously
- Integrated core DW plus 15 data marts
- Build/maintain > 600 database tables
- Facts 10 main tables, 175 fields
- Dims 150 tables, 755 fields
- Summary 200 tables, 3432 fields
9
11. Intro – Problem Domain
» Growth curve, data size delta over time
– Database : from 3 to 300 TB in 3 years
– Cluster : from 1 TB to ~1 PB in 3 years
– Events : from 50 to 150 Billion per year
» Special events that cause us angst:
– Tiger Woods, iPhone launches, March Madness,
– Football season, cyber monday,
– ISP/broadband slowdowns (video QOS),
– Kate's dress, Osama, e3, Comdex,
– Tom Brady injured, etc.
» Old systems were bleeding, real difficult to support
new volumes, requirements, uses
11
12. Intro – other logistical problems
» Re-Architecture is too big for a waterfall approach,
must be phased
» Other surprise/evolved goals/intermediate objectives:
– colo moves
– new business functions (tracking all streaming video for
CBS)
– Swapping in a new database for Data Warehouse, etc.
» Oh yeah, don’t plan on taking down time
12
13. Re-Architecture Goals
» Fix i/o bound processing
» Get more CPU horsepower
» Move away from proprietary systems (inaccessibility)
» Position for more agile change
» Adapt to a changing organization
» Deal with Legacy code
» Do all of the above economically
13
17. Re-Architecture Strategy
» General Tactics
– Code/ re-write
– Divide and conquer
– Moving parts (more or less)
– Paint the ship while it’s moving
– Do the hard stuff first
17
19. ETL Tactics
» ETL or ELT ?
» System Functions
1. Parsers – most complex/time consuming
2. History file creation/DB loads - reliable
3. Lookups – shared memory
4. Big dimensions – type 2 dimension > 20 Billion rows
5. Sessionize – complex reducer
19
20. Business/Other tactics
» Disaggregate system to allow re-architecting pieces
» Build bridges
» Begin with easiest SLA
» Start with most challenging data
» Plan for live soft launches in parallel
» Go for high resource (cpu/io) elements first
20
22. Release Tactics
» 16 releases in 24 months
» Allow parallel operation/soft launch capability
» Put bridges in place
22
23. Hadoop Skunk Works
» We do planning, purchasing, setup, admin, control till
stable
» Plan for turnover to central admin
» Plan for multi-tenancy
23
25. Hadoop Ecosystem
» M/R Streaming
» HDFS
» CDH2 GA
» HIVE
» CDH3
» Other groups : Pig, Hbase, Zookeeper
» SCM
25
26. Partner Strategy
» Internal System Administration
» Internal Platform Infrastructure Group
» External – Cloudera
» Internal Hadoop Admin – turnover control
» Management Support/$
26
27. Re-Architecture Strategy
» Mitigating risk
– Try not to bleed
– Try not to do it all ourselves
– Get a good application/job management tool
– Get/build a good test framework
– Do lots of testing
– Get support when needed
27
28. Re-Architecture Strategy
» Dealing with the real world
– New business requirements (e.g. tracking streaming vide
– Colo moves
28
30. CBSiHadoop Cluster
External
Data Other
Sources internal
Hadoop
systems
1.0 PB
CDH2/3
ETL Client
Web Tracking Servers
Data Reporting
Warehouse Database
250 TB
Internal BU
Data Sources
30
31. Goals Achieved
» Meeting SLA’s more reliably - Run time reduction
– cut 8 hours from nightly batch, can process magnitudes
more volume
» Relative cost reduction
» Fault tolerance
» Easily scalable/upgradeable
» Economically scalable/upgradeable
» Reliable components
» More manageable/maintainable
» Less reliant on proprietary systems
31
32. In Summary
» Hadoop is robust for mission critical processing
» Fault tolerance is a reality
» We’ve had excellent experience with stability of the
architecture
» Scalability is practically automatic
» We’ve learned to plan ahead with scaling to avoid
running at too high of a percentage of space
utilization
32
33. In Summary
» The Team
– Jim Haas, Dan Lescohier, Michael Sun, Ron Mahoney,
BatuUlug, SlavomirKrysiak, Richard Zhang
» Management Support
– Steph Lone, Guy Bayes
» The ETL package: Lumberjack
– lumberjack@cbsinteractive.com
33
Editor's Notes
CBSi has over 300 web sitesNotables: Gamespot, CNET, CBSSports, ZOL, PCHome, TV.com
CBS is a top premium content company on the web
50,00 foot view of the DW at CBSiSimply, the CBSi DW is sum of all- CBSiclickstream/event data- almost all internal systems (in functional/organizational data marts)plus external data (mobile, geo-location data, etc.)Cover high level DW functions: collecting, cleanse, categorize, trasform, store, feed
In summary,CBSi needs the DW for:- Metrics for informed design- Data for sites Data for billers/deals Data for these are not mutually exclusive, they can overlap
We focus on operations/where the rubber meets the road:Some interesting facts:Run over 800 jobs a dayPeak days are over 500 million events per day with current processingEvents can spike at 30,000 per secondBuild/maintain over 150 dimensionsBuild 10 facts tablesMake detailed data available for >24 months retention, up from 2 months previouslyIntegrated core DW plus 15 data martsBuild/maintain > 600 database tablesFacts 10 main table, 175 fieldsDims 150 tables, 755 fieldsSummary 200 tables, 3432 fields
We focus on operations/where the rubber meets the road:Some interesting facts:Run over 800 jobs a dayPeak days are over 500 million events per day with current processingEvents can spike at 30,000 per secondBuild/maintain over 150 dimensionsBuild 10 facts tablesMake detailed data available for >24 months retention, up from 2 months previouslyIntegrated core DW plus 15 data martsBuild/maintain > 600 database tablesFacts 12 main table, 175 fieldsDims 150 tables, 755 fieldsSummary 200 tables, 3432 fields
2008 CNET/CBS merge2009/2010 Video tracking2011 – only 3 quarters2012 ad events ill be processed
From a cluster/framework perspective : goal is to get a framework/infrastructure that deals gets as many of these as possibleSome of the goals obviously cannot be solved by the cluster framework alone
Make a distinction between ETL and frameworkWe have a predilection for buildingCNET has history of build and open source for solutionsBut have purchased some technology for obvious reasons: databases, reporting, job mgt., etc.we already began building our own ET We ruled out using a service such as Amazon EC2 for a few strategic reasons- we wanted data inside our walls- we wanted control over performance- we perceived it as more cost effective
We’ve done several POC’s in last 4 years, job mgt., databaseDue to conditions, we modified our approach after doing paper evaluations of available cluster solutionsWe decided:To only do a POC of Hadoop, skunk works styleFocused on what we really needed:ParallelismHdfsScalabilityExtensibilityRationalizing the costs/benefitsStability and frameworks
re-architecture also involved recoding Section of architectural blocks of the system and attack in a meaningful wayGood engineering says less pieces – so we decided to make more parts, but made them simpler and modular, we disaggregated processes no down time, so it had to be easy to swap in pieces start with a lower SLA/less critical set of processes
BetterFasterCheaperWe concentrated on writing code that was betterMostly we relied on framework for faster and cheaper
We had previously decided on etl, we did not like nor have lots of luck with eltThis was the order of changing systems pieces over in generalTHIS IS THE GENERAL ORDER OF ATTACKParsers - were resource intensive, probably most broken history/dbload - we wanted to begin storing history on hadoop to facilitate/lessen the retention of data in the database, i.e. keep the longer tail of history out of the db where its cheaper, allowed us to eliminate some of the larger db backup processesLookups - We needed lookups in hadoop so that we could eliminate passing data between old system and newBig dimensions - url and title dimensions were quite technically challenging, super sized dimsSessionize – heart of the system, we could eliminate most data passing between old cluster and new once this piece was in. However, this piece was most complex, risky and most significant. So we build up to it.
disaggregate : use an Application mgt system , we abstracted all operational control that was feasible to this top level Bridges – old cluster to new, offload database to cluster, etc. SLA – not as important at that time that it be done quickly tried to build/deploy all pieces as soft launch/parallel wherever possible, then flip a switch to switch to re-architected once we see everything flows/works in live environment well started with china, significant volumes, challenging data (unicode), simpler application flow,- go for high resource usage processes that suffered from lengthy processes
Lots of data checking, since we have ~ 400 fact fields in 10 fact tables, and ~ peaks over 500 M recs/day, needed to really check data thoroughly control layer not so hard, it was easy since we focused on control abstraction wrote a couple of tools to do mass data checking resultant database compares, sampling to the extreme in cluster data compare, could do brut force old and new with summaries there was lots of data archaeology to determine good versus bad differences, 99% of differences during testing were good, aka result better code
Chunk it up, but not too much play it safe, try and see that everything works at scale before its active release and deploy, trial bridges/interfaces to new system before they need to be used
For speed we did lots ourselves from the begiinng, we planned on turning infrastructure/admin to central group if all went well talked about sharing, but did not concentrate on it or begin acting on it initially, then likewise began in earnest once our ‘experiment’ was deemed successful
Get our data pipelines to run as mappers upgrade our harvesting to a simpler yet reliable and faster model go for some more difficult but very beneficial , once done we could really drop flow through old cluster once we are adept at hadoop, and are confident of it, move the heart of the system expand our hadoop frameworks, hive, pig (other groups), zookeeper, etc., also perhaps it’s a bit of wanting to not be on bleeding edge and finding stability
General order of adoption, usem/r and hdfs obviously beginning with concurrently, in other words the basicsDecided to go to cdh2 and stick with whatsga for stability/reliability reasonsWent to hive once we felt cluster could handle it, significant data stores there to use
Partnering in the loose sense, both internal and externalNeeded sys admins help to spec, purchase, buy, install, admin linux boxesNeeded our pi group to do our custom builds of cdh/packages for compatibility and software infrastructure managementReally cloudera since we started, using resources, builds, online training, consulting, etc.Need internal hadoopadmins to we can go full force and get on withj building apps /systemsNeeded mgt support for obvious reasons