Hadoop in a Mission Critical Environment

Thursdays
9:00 ET/PT

Hadoop in a Mission Critical Environment
November 2011
Jim Haas, Director Data Warehouse ETL

About me

» Worked at AT&T for 16 years building large scale
Financial Systems: Billers, GL, Data Warehouse,
Cost Modeling Systems for Network Services,
Business and Consumer Business Units
» Agency.com: several key clients – USSB, DIRECTV,
GMACCM, Keyspan
» CNET/CBSi – Director of Data Warehouse ETL
systems. Have worked at CNET/CBSi since 2005 re-
architecting Data Warehouse systems.

3

Top 20 Web Property
Global Unique User Ranking (000) US Unique User Ranking (000)

1 Google Sites 1,066,695 1 Google Sites 184,582
2 Microsoft Sites 914,237 2 Microsoft Sites 178,014
3Facebook 769,655 3 Yahoo! Sites 177,123
4 Yahoo! Sites 701,378 4Facebook 163,021
5 Wikimedia Sites 454,529 5 AOL, Inc. 105,861
6 Amazon Sites 319,548 6 Amazon Sites 103,709
7 Apple Inc. 264,537 7 Ask Network 91,994
8TencentInc. 245,220 8 Turner Digital 89,981
9 CBS Interactive 242,571 9 Glam Media 88,303
10 Ask Network 240,805 10 Wikimedia Sites 83,836
11 CBS Interactive 83,463

Source: Global Ranking based on comScore Worldwide MediaMetrix for the month of September, 2011. US ranking based on
5 comScore US MediaMetrix for the month of September 2011.

Data Warehouse at CBSi

Data
Warehouse

External
Data
Sources
Web Events/ Click-stream Internal Systems/
Content Management

6

Intro – Business functions we support

» Web site/media metrics (BI)
» Website re-design A/B testing
» Financial billers (download, clicks, partners, ads)
» Ad event tracking
» Data feeds for sites
» External reporting
» Custom event tracking servers
– clicks, page views, downloads,
– streaming video events, ad events, etc.

7

Intro - Data Warehouse Back-End

» Architect
» Design
» Code
» Operations
» Early adoption admin
» Hardware recommendations
» Technology, Clusters, Database recommendations

8

Intro - Data Warehouse Back-End

» Some interesting facts:
- Run over 800 jobs a day
- Peak days are over 500 million events per day with current
processing, next quarter will be 1 billion per day
- Events can spike at 30,000 per second
- Build/maintain over 150 dimensions
- Build 10 facts tables
- Make detailed data available for >24 months retention, up from
2 months previously
- Integrated core DW plus 15 data marts
- Build/maintain > 600 database tables
- Facts 10 main tables, 175 fields
- Dims 150 tables, 755 fields
- Summary 200 tables, 3432 fields

9

Intro – Problem Domain

10

Intro – Problem Domain

» Growth curve, data size delta over time
– Database : from 3 to 300 TB in 3 years
– Cluster : from 1 TB to ~1 PB in 3 years
– Events : from 50 to 150 Billion per year

» Special events that cause us angst:
– Tiger Woods, iPhone launches, March Madness,
– Football season, cyber monday,
– ISP/broadband slowdowns (video QOS),
– Kate's dress, Osama, e3, Comdex,
– Tom Brady injured, etc.

» Old systems were bleeding, real difficult to support
new volumes, requirements, uses
11

Intro – other logistical problems

» Re-Architecture is too big for a waterfall approach,
must be phased
» Other surprise/evolved goals/intermediate objectives:
– colo moves
– new business functions (tracking all streaming video for
CBS)
– Swapping in a new database for Data Warehouse, etc.

» Oh yeah, don’t plan on taking down time

12

Re-Architecture Goals

» Fix i/o bound processing
» Get more CPU horsepower
» Move away from proprietary systems (inaccessibility)
» Position for more agile change
» Adapt to a changing organization
» Deal with Legacy code
» Do all of the above economically

13

Re-Architecture Strategy

» Build
» Buy
» Open Source
» Service

15


» POC/Proof of concept
» Rules of engagement

16


» General Tactics
– Code/ re-write
– Divide and conquer
– Moving parts (more or less)
– Paint the ship while it’s moving
– Do the hard stuff first

17


» General Strategy
– Faster, better, cheaper ?

18

ETL Tactics

» ETL or ELT ?
» System Functions
1. Parsers – most complex/time consuming
2. History file creation/DB loads - reliable
3. Lookups – shared memory
4. Big dimensions – type 2 dimension > 20 Billion rows
5. Sessionize – complex reducer

19

Business/Other tactics

» Disaggregate system to allow re-architecting pieces
» Build bridges
» Begin with easiest SLA
» Start with most challenging data
» Plan for live soft launches in parallel
» Go for high resource (cpu/io) elements first

20

Testing tactics

» Data centric testing
» Properly abstracted controls
» Tools

21

Release Tactics

» 16 releases in 24 months
» Allow parallel operation/soft launch capability
» Put bridges in place

22

Hadoop Skunk Works

» We do planning, purchasing, setup, admin, control till
stable
» Plan for turnover to central admin
» Plan for multi-tenancy

23

Hadoop Tactics/ Order of sophistication

» Hadoop Streaming
» Mappers
» Parallelized Collector
» Lookups
» Sessionize/Complex Reducer
» Hadoop Ecosystem

24

Hadoop Ecosystem

» M/R Streaming
» HDFS
» CDH2 GA
» HIVE
» CDH3
» Other groups : Pig, Hbase, Zookeeper
» SCM

25

Partner Strategy

» Internal System Administration
» Internal Platform Infrastructure Group
» External – Cloudera
» Internal Hadoop Admin – turnover control
» Management Support/$

26


» Mitigating risk
– Try not to bleed
– Try not to do it all ourselves
– Get a good application/job management tool
– Get/build a good test framework
– Do lots of testing
– Get support when needed

27


» Dealing with the real world
– New business requirements (e.g. tracking streaming vide
– Colo moves

28

CBSiHadoop Cluster

External
Data Other
Sources internal
Hadoop
systems
1.0 PB

CDH2/3
ETL Client

Web Tracking Servers

Data Reporting
Warehouse Database
250 TB

Internal BU
Data Sources

30

Goals Achieved

» Meeting SLA’s more reliably - Run time reduction
– cut 8 hours from nightly batch, can process magnitudes
more volume

» Relative cost reduction
» Fault tolerance
» Easily scalable/upgradeable
» Economically scalable/upgradeable
» Reliable components
» More manageable/maintainable
» Less reliant on proprietary systems
31

In Summary

» Hadoop is robust for mission critical processing
» Fault tolerance is a reality
» We’ve had excellent experience with stability of the
architecture
» Scalability is practically automatic
» We’ve learned to plan ahead with scaling to avoid
running at too high of a percentage of space
utilization

32

In Summary

» The Team
– Jim Haas, Dan Lescohier, Michael Sun, Ron Mahoney,
BatuUlug, SlavomirKrysiak, Richard Zhang

» Management Support
– Steph Lone, Guy Bayes

» The ETL package: Lumberjack
– lumberjack@cbsinteractive.com

33

Hadoop in a Mission Critical Environment

Recommended

Recommended

More Related Content

Similar to Hadoop in a Mission Critical Environment

Similar to Hadoop in a Mission Critical Environment (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Hadoop in a Mission Critical Environment

Editor's Notes