Chicago HUG Presentation Oct 2011

GENTLE STROLL DOWN
THE ANALYTICS MEMORY
LANE
Abe Taha
VP Engineering, Karmapshere
Oct 19th, 2011

1 © Karmasphere 2011 All rights reserved

What is this talk about

• This talk is a story about building an analytics services team at
Ning and the experiences and lessons learned
• There is also a bit about how I’d do things differently
• And like a good story, an ending


Caveat Lector

• The story has no pictures or conversations
• “And what is the use of a book," thought Alice, "without
pictures or conversations?”

Alice’s Adventures in Wonderland, Lewis Carroll

Your storyteller

• Mostly scalable distributed systems background
• At Yahoo–Search and Social Search
• At Google—App infrastructure
• At Ning—Hadoop for Analytics and System Management services
• At Ask—Dictionary/Reference properties
• Now at Karmasphere building analytics applications on Hadoop


Prologue

• The story begins at Ning
• Starting an analytics and systems management teams
• In 2008
• When Hadoop was gaining popularity
• v0.16 was out


A bit about Ning

• Hot company at the time, co-founded by Andreessen
• Allowed users to build websites that look like Facebook
• Websites called networks
• Networks had social features
• Blogs
• Photos
• Videos
• Chat
• Social graph
• Each network had a major topic/category
• Most networks were free, few for pay
• Free networks monetized through contextual ads
• The theory was that people produce good content that you can
monetize


Raison d’etre for the analytics team

• Figure out what ads to display on the network
• Look at user generated content (UGC)
• Posts
• Comments and discussions
• Tags on photos and videos
• Come up with categories for networks and ads
• Model network trends and business metrics
• Predict serving machine growth (poor man’s ec2)
• Model machine and application data (poor man’s ec2)
• Memory, disk, CPU, network
• Application logs, counters, etc


First: building the team

• Data scientist title not common then, second best engineers
• Distributed systems engineers (3) for the infrastructure
• Statistics and ML engineers (2) for modeling and trending
• Data visualization engineers (1) for building dashboards to interact
with the data
• Systems management engineers (2) for building the machine
monitoring systems


Second: figuring out where the data is

• Typical company scenario
• Data resides in log files
• Machine or application logs
• Stored locally
• Purged after 30 days


Third: where to keep the data

• Wanted to keep all the historical data
• In a centralized place
• Without paying too much money
• Or using specialized hardware
• Ruled out DW
• Had experience with systems that looked like Hadoop (or
Hadoop looked like them)
• Team wanted to experiment with newer technology
• -> Data in Hadoop
• V1: POC


V1: getting data in

• Minor changes to store all machine and application logs on NFS
drive
• A couple of retired NetApps filers
• Log files copied into HDFS using the Hadoop client
• Data organized by source in a directory hierarchy
• Grouped by date
• No preprocessing
• 3x replication
• Some latency in moving the data


V1: now what

• Custom Java map-reduce programs to process the data
• Support libraries to parse different log file formats
• Jobs did simple analytics
• Averages
• Network response times
• User engagement
• Trends per network
• Active users
• Pageviews
• Most common/popular
• Browsers, pages, queries
• Indexing
• Machine utilization
• Simple scheduler to run jobs


V1: dashboarding

• Results stored in flat files in HDFS
• Grouped daily/weekly/monthly
• Use gnuplot to build dashboards every hour


What did we learn from V1

• POC proved viability of Hadoop
• Latency of pulling files was an issue
• Most of the metrics computations are of the same nature
• People need flexibility in defining what is measured
• Once you put data in front of people, they ask more questions
• POC shows which areas are a pain, and where to invest to fix


V2: changing data ingestion

• Use event records instead of log files
• Pushed through HTTP
• Build using Thrift
• Events have
• Names
• Timestamps
• Host
• Version
• Payloads
• Published catalog
• All available events
• Event parsers
• Load ~50 million external page views (~10 events per page)

V2: collectors

• Receive events
• Put in a memory queue
• Background processes store to local disk
• Check events for validity against catalog
• Separate into valid/invalid queues
• Another process sucks data into HDFS and organize in a
directory hierarchy
• Events
• Grouped by date


V2: computation abstraction

• Common tasks
• Projection
• What fields am I interested in
• Filtering
• What records I am interested in
• Aggregations
• What do I want to do with the metrics
• Common readers and writers for data types
• Captured in libraries that can be composed for complex
analytics


V2: better dashboards

• Metrics summarized in MySQL databases
• Interactive dashboards using Ruby/Senatra
• Select metrics
• Time range
• Aggregation method
• Plot results using FusionCharts
• OpenCharts was a close second, but no combined charts
(Histograms, line charts)


What did we learn from V2

• HDFS I/O is better than the local disk
• No need for the process that saves locally then to HDFS
• People loved events
• Led to event abuse
• Each feature on the page had an associated event
• Events were used for performance tuning: how much time did a feature
take
• Events were used for monitoring backend features: record errors with
services
• Large number of files cause problems for the namenode
• Need to coalesce events to reduce file number
• With flexible event types, and interactive dashboards, people have
more questions
• We couldn’t keep up with developing custom metrics and charts
• Needed a self serve query mechanism


V3: ingestion

• Minor modifications
• Collectors now write to HDFS
• Collectors accumulate events to reduce file number
• Self serve UI for defining new events outside of the metrics
team


V3: computation

• Need a higher level language for query
• JSON API exposing a search like query syntax
• {from: ‘date’, to: ‘date’, metric:’x’, computation}
• Computations are encapsulated into libraries and exposed
through JSON
• Users can add metrics and computations and build frontends for
the query language
• Custom code for ML tasks
• Cascading for algorithms
• R for visualization


V3: dashboards

• More intermediate data precomputed
• Data stored in Hbase
• Dashboards go against HBase
• Templates for users to build custom dashboards


V3: What did we learn

• Self serve is the way to go
• Give people the infrastructure and the support libraries and
they’ll go to town
• Some tasks still can’t be done in a framework and needs custom
code
• Machine learning, with analysis on R
• ML is hard, even with experience
• Data is not clean
• Some content is very small
• Comments on pictures and videos (workarounds for aggregation)
• Even then you can build products around the results
• People and network recommenders
• Network categories for ads


How would we do it differently today

• Open source obviates custom code
• Scribe for data ingestion
• Hive for self serve analytics and business intelligence
• Pig scripts subsume most of the Java code
• Cascading for Java map-reduce
• Dashboards still stay the same


Epilogue

• ML analysis showed most usage is spam
• Shutdown a lot of pr0n networks and video hosting networks in
far east Asia
• Team moved to different companies
• Still in analytics at LI, FB, and twitter
• Company changed business model to for pay only and laid off
half the staff 6 months later
• Company acquired recently


Takeaway

• The problems and solutions are mostly the same everywhere
• Getting data into Hadoop
• How do you compute over the data
• Getting meaningful data out of Hadoop
• Lots of software components exist to help you with these
• It is about the balance of what you develop vs what you acquire


Q&A


The Leader in Big Data Intelligence on Hadoop

www.karmasphere.com

Chicago HUG Presentation Oct 2011

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a Chicago HUG Presentation Oct 2011

Semelhante a Chicago HUG Presentation Oct 2011 (20)

Último

Último (20)

Chicago HUG Presentation Oct 2011