Vector Search -An Introduction in Oracle Database 23ai.pptx
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive - Michael Sun - CBSi
1. Building Web Analytics on Hadoop at CBS Interactive Michael Sun [email_address] Hadoop World November 8 2011
2. $2 What is fog computing? Deep Thoughts $1 What is cloud computing? Convenient, on-demand, scalable, network accessible service of the shared pool of computing resources Vaporware $5 What is vapor computing? Local cloud computing
3.
4. Brands and Websites of CBS interactive, Samples GAMES & MOVIES TECH, BIZ & NEWS SPORTS ENTERTAINMENT MUSIC
5.
6.
7.
8.
9.
10. Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems
18. Input to Sessionize Take output data from parsed data of type: page impression, click-payable, click-nonpayable, video tracking, optimization event types
19.
20.
21.
22.
23. Web Analytics Hadoop External data sources HDFS Python-ETL MapReduce Hive DW Database Sites Apache Logs Distribute log by Fido Web metrics Billers Data mining CMS Systems
24.
25.
26.
27. The Team (alphabetical order) Batu Ulug Dan Lescohier Jim Haas, presenting “Hadoop in Mission-critical Environment” Michael Sun Richard Zhang Ron Mahoney Slawomir Krysiak
29. Abstract CBS Interactive successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties that CBS Interactive oversees. After introducing Lumberjack—the Extraction, Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release—Michael will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, CBS Interactive achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).
Editor's Notes
CBSi has a number of brands, this slide shows the biggest ones.
We have a lot of traffic and data. We’ve been using Hadoop quite extensively for a few years now.