SlideShare uma empresa Scribd logo
1 de 28
GENTLE STROLL DOWN
            THE ANALYTICS MEMORY
            LANE
            Abe Taha
            VP Engineering, Karmapshere
            Oct 19th, 2011




1 © Karmasphere 2011 All rights reserved
What is this talk about

    • This talk is a story about building an analytics services team at
      Ning and the experiences and lessons learned
    • There is also a bit about how I’d do things differently
    • And like a good story, an ending




2 © Karmasphere 2011 All rights reserved
Caveat Lector

    • The story has no pictures or conversations
    • “And what is the use of a book," thought Alice, "without
      pictures or conversations?”




             Alice’s Adventures in Wonderland, Lewis Carroll
3 © Karmasphere 2011 All rights reserved
Your storyteller

    • Mostly scalable distributed systems background
          •    At Yahoo–Search and Social Search
          •    At Google—App infrastructure
          •    At Ning—Hadoop for Analytics and System Management services
          •    At Ask—Dictionary/Reference properties
    • Now at Karmasphere building analytics applications on Hadoop




4 © Karmasphere 2011 All rights reserved
Prologue

    • The story begins at Ning
    • Starting an analytics and systems management teams
    • In 2008
    • When Hadoop was gaining popularity
    • v0.16 was out




5 © Karmasphere 2011 All rights reserved
A bit about Ning

    • Hot company at the time, co-founded by Andreessen
    • Allowed users to build websites that look like Facebook
    • Websites called networks
    • Networks had social features
          •    Blogs
          •    Photos
          •    Videos
          •    Chat
          •    Social graph
    • Each network had a major topic/category
    • Most networks were free, few for pay
    • Free networks monetized through contextual ads
    • The theory was that people produce good content that you can
      monetize

6 © Karmasphere 2011 All rights reserved
Raison d’etre for the analytics team

    • Figure out what ads to display on the network
          • Look at user generated content (UGC)
               • Posts
               • Comments and discussions
               • Tags on photos and videos
          • Come up with categories for networks and ads
    • Model network trends and business metrics
    • Predict serving machine growth (poor man’s ec2)
          • Model machine and application data (poor man’s ec2)
               • Memory, disk, CPU, network
               • Application logs, counters, etc




7 © Karmasphere 2011 All rights reserved
First: building the team

    • Data scientist title not common then, second best engineers
          • Distributed systems engineers (3) for the infrastructure
          • Statistics and ML engineers (2) for modeling and trending
          • Data visualization engineers (1) for building dashboards to interact
            with the data
          • Systems management engineers (2) for building the machine
            monitoring systems




8 © Karmasphere 2011 All rights reserved
Second: figuring out where the data is

    • Typical company scenario
          • Data resides in log files
               • Machine or application logs
          • Stored locally
          • Purged after 30 days




9 © Karmasphere 2011 All rights reserved
Third: where to keep the data

    • Wanted to keep all the historical data
    • In a centralized place
    • Without paying too much money
    • Or using specialized hardware
    • Ruled out DW
    • Had experience with systems that looked like Hadoop (or
      Hadoop looked like them)
    • Team wanted to experiment with newer technology
    • -> Data in Hadoop
    • V1: POC




10 © Karmasphere 2011 All rights reserved
V1: getting data in

    • Minor changes to store all machine and application logs on NFS
      drive
          • A couple of retired NetApps filers
    • Log files copied into HDFS using the Hadoop client
    • Data organized by source in a directory hierarchy
    • Grouped by date
    • No preprocessing
    • 3x replication
    • Some latency in moving the data




11 © Karmasphere 2011 All rights reserved
V1: now what

    • Custom Java map-reduce programs to process the data
    • Support libraries to parse different log file formats
    • Jobs did simple analytics
          • Averages
               • Network response times
               • User engagement
          • Trends per network
               • Active users
               • Pageviews
          • Most common/popular
               • Browsers, pages, queries
          • Indexing
          • Machine utilization
    • Simple scheduler to run jobs

12 © Karmasphere 2011 All rights reserved
V1: dashboarding

    • Results stored in flat files in HDFS
    • Grouped daily/weekly/monthly
    • Use gnuplot to build dashboards every hour




13 © Karmasphere 2011 All rights reserved
What did we learn from V1

    • POC proved viability of Hadoop
    • Latency of pulling files was an issue
    • Most of the metrics computations are of the same nature
    • People need flexibility in defining what is measured
    • Once you put data in front of people, they ask more questions
    • POC shows which areas are a pain, and where to invest to fix




14 © Karmasphere 2011 All rights reserved
V2: changing data ingestion

    • Use event records instead of log files
    • Pushed through HTTP
    • Build using Thrift
    • Events have
          •    Names
          •    Timestamps
          •    Host
          •    Version
          •    Payloads
    • Published catalog
          • All available events
          • Event parsers
    • Load ~50 million external page views (~10 events per page)
15 © Karmasphere 2011 All rights reserved
V2: collectors


    • Receive events
    • Put in a memory queue
    • Background processes store to local disk
    • Check events for validity against catalog
    • Separate into valid/invalid queues
    • Another process sucks data into HDFS and organize in a
      directory hierarchy
           • Events
           • Grouped by date


16 © Karmasphere 2011 All rights reserved
V2: computation abstraction

    • Common tasks
          • Projection
               • What fields am I interested in
          • Filtering
               • What records I am interested in
          • Aggregations
               • What do I want to do with the metrics
          • Common readers and writers for data types
    • Captured in libraries that can be composed for complex
      analytics




17 © Karmasphere 2011 All rights reserved
V2: better dashboards

    • Metrics summarized in MySQL databases
    • Interactive dashboards using Ruby/Senatra
          • Select metrics
          • Time range
          • Aggregation method
    • Plot results using FusionCharts
          • OpenCharts was a close second, but no combined charts
            (Histograms, line charts)




18 © Karmasphere 2011 All rights reserved
What did we learn from V2

    • HDFS I/O is better than the local disk
          • No need for the process that saves locally then to HDFS
    • People loved events
    • Led to event abuse
          • Each feature on the page had an associated event
          • Events were used for performance tuning: how much time did a feature
            take
          • Events were used for monitoring backend features: record errors with
            services
    • Large number of files cause problems for the namenode
          • Need to coalesce events to reduce file number
    • With flexible event types, and interactive dashboards, people have
      more questions
    • We couldn’t keep up with developing custom metrics and charts
          • Needed a self serve query mechanism


19 © Karmasphere 2011 All rights reserved
V3: ingestion

    • Minor modifications
          • Collectors now write to HDFS
          • Collectors accumulate events to reduce file number
    • Self serve UI for defining new events outside of the metrics
      team




20 © Karmasphere 2011 All rights reserved
V3: computation

    • Need a higher level language for query
          • JSON API exposing a search like query syntax
          • {from: ‘date’, to: ‘date’, metric:’x’, computation}
    • Computations are encapsulated into libraries and exposed
      through JSON
    • Users can add metrics and computations and build frontends for
      the query language
    • Custom code for ML tasks
          • Cascading for algorithms
          • R for visualization




21 © Karmasphere 2011 All rights reserved
V3: dashboards

    • More intermediate data precomputed
    • Data stored in Hbase
    • Dashboards go against HBase
    • Templates for users to build custom dashboards




22 © Karmasphere 2011 All rights reserved
V3: What did we learn

    • Self serve is the way to go
    • Give people the infrastructure and the support libraries and
      they’ll go to town
    • Some tasks still can’t be done in a framework and needs custom
      code
          • Machine learning, with analysis on R
    • ML is hard, even with experience
          • Data is not clean
          • Some content is very small
               • Comments on pictures and videos (workarounds for aggregation)
    • Even then you can build products around the results
          • People and network recommenders
          • Network categories for ads

23 © Karmasphere 2011 All rights reserved
How would we do it differently today

    • Open source obviates custom code
          •    Scribe for data ingestion
          •    Hive for self serve analytics and business intelligence
          •    Pig scripts subsume most of the Java code
          •    Cascading for Java map-reduce
          •    Dashboards still stay the same




24 © Karmasphere 2011 All rights reserved
Epilogue

    • ML analysis showed most usage is spam
    • Shutdown a lot of pr0n networks and video hosting networks in
      far east Asia
    • Team moved to different companies
          • Still in analytics at LI, FB, and twitter
    • Company changed business model to for pay only and laid off
      half the staff 6 months later
    • Company acquired recently




25 © Karmasphere 2011 All rights reserved
Takeaway

    • The problems and solutions are mostly the same everywhere
          • Getting data into Hadoop
          • How do you compute over the data
          • Getting meaningful data out of Hadoop
    • Lots of software components exist to help you with these
    • It is about the balance of what you develop vs what you acquire




26 © Karmasphere 2011 All rights reserved
Q&A




27 © Karmasphere 2011 All rights reserved
The Leader in Big Data Intelligence on Hadoop




                                            www.karmasphere.com
28 © Karmasphere 2011 All rights reserved

Mais conteúdo relacionado

Mais procurados

Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineTips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineAndries den Haan
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopDataWorks Summit
 
Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...
Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...
Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...Brian Huff
 
Performing successful migrations to the microsoft cloud
Performing successful migrations to the microsoft cloudPerforming successful migrations to the microsoft cloud
Performing successful migrations to the microsoft cloudAndries den Haan
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksMapR Technologies
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Gridsjlorenzocima
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
FatWire Tutorial For Site Studio Developers
FatWire Tutorial For Site Studio DevelopersFatWire Tutorial For Site Studio Developers
FatWire Tutorial For Site Studio DevelopersBrian Huff
 
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentHow to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentDataWorks Summit
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetDataWorks Summit
 

Mais procurados (20)

Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineTips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint Online
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on Hadoop
 
Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...
Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...
Integrating ECM (WebCenter Content) with your Enterprise! 5 Tips to Try, 5 Tr...
 
Performing successful migrations to the microsoft cloud
Performing successful migrations to the microsoft cloudPerforming successful migrations to the microsoft cloud
Performing successful migrations to the microsoft cloud
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
FatWire Tutorial For Site Studio Developers
FatWire Tutorial For Site Studio DevelopersFatWire Tutorial For Site Studio Developers
FatWire Tutorial For Site Studio Developers
 
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentHow to Ingest 16 Billion Records Per Day into your Hadoop Environment
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 

Destaque

Destaque (7)

Karmasphere bdabi blueprint- final
Karmasphere bdabi blueprint- finalKarmasphere bdabi blueprint- final
Karmasphere bdabi blueprint- final
 
Seattle hug 2010
Seattle hug 2010Seattle hug 2010
Seattle hug 2010
 
Social Media and Public Health
Social Media and Public HealthSocial Media and Public Health
Social Media and Public Health
 
Copyright edtc6340.66 april_canales#3
Copyright edtc6340.66 april_canales#3Copyright edtc6340.66 april_canales#3
Copyright edtc6340.66 april_canales#3
 
Quality Milk Through Milking Parlor Technology
Quality Milk Through Milking Parlor TechnologyQuality Milk Through Milking Parlor Technology
Quality Milk Through Milking Parlor Technology
 
Malta Trading Company Tax System Guide - Acumum Legal & Advisory
Malta Trading Company Tax System  Guide - Acumum Legal & AdvisoryMalta Trading Company Tax System  Guide - Acumum Legal & Advisory
Malta Trading Company Tax System Guide - Acumum Legal & Advisory
 
Akamai: From Theory to Practice
Akamai: From Theory to PracticeAkamai: From Theory to Practice
Akamai: From Theory to Practice
 

Semelhante a Chicago HUG Presentation Oct 2011

Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureMachine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureCloudera, Inc.
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedCloudera, Inc.
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?CQD
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems
 
Going Mobile with HTML5
Going Mobile with HTML5Going Mobile with HTML5
Going Mobile with HTML5John Reiser
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedInkbajda
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
The Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of dataThe Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of databcantrill
 

Semelhante a Chicago HUG Presentation Oct 2011 (20)

Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureMachine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons Learned
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Containers and Big Data
Containers and Big Data Containers and Big Data
Containers and Big Data
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Going Mobile with HTML5
Going Mobile with HTML5Going Mobile with HTML5
Going Mobile with HTML5
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
The Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of dataThe Internet-of-things: Architecting for the deluge of data
The Internet-of-things: Architecting for the deluge of data
 

Último

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Último (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Chicago HUG Presentation Oct 2011

  • 1. GENTLE STROLL DOWN THE ANALYTICS MEMORY LANE Abe Taha VP Engineering, Karmapshere Oct 19th, 2011 1 © Karmasphere 2011 All rights reserved
  • 2. What is this talk about • This talk is a story about building an analytics services team at Ning and the experiences and lessons learned • There is also a bit about how I’d do things differently • And like a good story, an ending 2 © Karmasphere 2011 All rights reserved
  • 3. Caveat Lector • The story has no pictures or conversations • “And what is the use of a book," thought Alice, "without pictures or conversations?” Alice’s Adventures in Wonderland, Lewis Carroll 3 © Karmasphere 2011 All rights reserved
  • 4. Your storyteller • Mostly scalable distributed systems background • At Yahoo–Search and Social Search • At Google—App infrastructure • At Ning—Hadoop for Analytics and System Management services • At Ask—Dictionary/Reference properties • Now at Karmasphere building analytics applications on Hadoop 4 © Karmasphere 2011 All rights reserved
  • 5. Prologue • The story begins at Ning • Starting an analytics and systems management teams • In 2008 • When Hadoop was gaining popularity • v0.16 was out 5 © Karmasphere 2011 All rights reserved
  • 6. A bit about Ning • Hot company at the time, co-founded by Andreessen • Allowed users to build websites that look like Facebook • Websites called networks • Networks had social features • Blogs • Photos • Videos • Chat • Social graph • Each network had a major topic/category • Most networks were free, few for pay • Free networks monetized through contextual ads • The theory was that people produce good content that you can monetize 6 © Karmasphere 2011 All rights reserved
  • 7. Raison d’etre for the analytics team • Figure out what ads to display on the network • Look at user generated content (UGC) • Posts • Comments and discussions • Tags on photos and videos • Come up with categories for networks and ads • Model network trends and business metrics • Predict serving machine growth (poor man’s ec2) • Model machine and application data (poor man’s ec2) • Memory, disk, CPU, network • Application logs, counters, etc 7 © Karmasphere 2011 All rights reserved
  • 8. First: building the team • Data scientist title not common then, second best engineers • Distributed systems engineers (3) for the infrastructure • Statistics and ML engineers (2) for modeling and trending • Data visualization engineers (1) for building dashboards to interact with the data • Systems management engineers (2) for building the machine monitoring systems 8 © Karmasphere 2011 All rights reserved
  • 9. Second: figuring out where the data is • Typical company scenario • Data resides in log files • Machine or application logs • Stored locally • Purged after 30 days 9 © Karmasphere 2011 All rights reserved
  • 10. Third: where to keep the data • Wanted to keep all the historical data • In a centralized place • Without paying too much money • Or using specialized hardware • Ruled out DW • Had experience with systems that looked like Hadoop (or Hadoop looked like them) • Team wanted to experiment with newer technology • -> Data in Hadoop • V1: POC 10 © Karmasphere 2011 All rights reserved
  • 11. V1: getting data in • Minor changes to store all machine and application logs on NFS drive • A couple of retired NetApps filers • Log files copied into HDFS using the Hadoop client • Data organized by source in a directory hierarchy • Grouped by date • No preprocessing • 3x replication • Some latency in moving the data 11 © Karmasphere 2011 All rights reserved
  • 12. V1: now what • Custom Java map-reduce programs to process the data • Support libraries to parse different log file formats • Jobs did simple analytics • Averages • Network response times • User engagement • Trends per network • Active users • Pageviews • Most common/popular • Browsers, pages, queries • Indexing • Machine utilization • Simple scheduler to run jobs 12 © Karmasphere 2011 All rights reserved
  • 13. V1: dashboarding • Results stored in flat files in HDFS • Grouped daily/weekly/monthly • Use gnuplot to build dashboards every hour 13 © Karmasphere 2011 All rights reserved
  • 14. What did we learn from V1 • POC proved viability of Hadoop • Latency of pulling files was an issue • Most of the metrics computations are of the same nature • People need flexibility in defining what is measured • Once you put data in front of people, they ask more questions • POC shows which areas are a pain, and where to invest to fix 14 © Karmasphere 2011 All rights reserved
  • 15. V2: changing data ingestion • Use event records instead of log files • Pushed through HTTP • Build using Thrift • Events have • Names • Timestamps • Host • Version • Payloads • Published catalog • All available events • Event parsers • Load ~50 million external page views (~10 events per page) 15 © Karmasphere 2011 All rights reserved
  • 16. V2: collectors • Receive events • Put in a memory queue • Background processes store to local disk • Check events for validity against catalog • Separate into valid/invalid queues • Another process sucks data into HDFS and organize in a directory hierarchy • Events • Grouped by date 16 © Karmasphere 2011 All rights reserved
  • 17. V2: computation abstraction • Common tasks • Projection • What fields am I interested in • Filtering • What records I am interested in • Aggregations • What do I want to do with the metrics • Common readers and writers for data types • Captured in libraries that can be composed for complex analytics 17 © Karmasphere 2011 All rights reserved
  • 18. V2: better dashboards • Metrics summarized in MySQL databases • Interactive dashboards using Ruby/Senatra • Select metrics • Time range • Aggregation method • Plot results using FusionCharts • OpenCharts was a close second, but no combined charts (Histograms, line charts) 18 © Karmasphere 2011 All rights reserved
  • 19. What did we learn from V2 • HDFS I/O is better than the local disk • No need for the process that saves locally then to HDFS • People loved events • Led to event abuse • Each feature on the page had an associated event • Events were used for performance tuning: how much time did a feature take • Events were used for monitoring backend features: record errors with services • Large number of files cause problems for the namenode • Need to coalesce events to reduce file number • With flexible event types, and interactive dashboards, people have more questions • We couldn’t keep up with developing custom metrics and charts • Needed a self serve query mechanism 19 © Karmasphere 2011 All rights reserved
  • 20. V3: ingestion • Minor modifications • Collectors now write to HDFS • Collectors accumulate events to reduce file number • Self serve UI for defining new events outside of the metrics team 20 © Karmasphere 2011 All rights reserved
  • 21. V3: computation • Need a higher level language for query • JSON API exposing a search like query syntax • {from: ‘date’, to: ‘date’, metric:’x’, computation} • Computations are encapsulated into libraries and exposed through JSON • Users can add metrics and computations and build frontends for the query language • Custom code for ML tasks • Cascading for algorithms • R for visualization 21 © Karmasphere 2011 All rights reserved
  • 22. V3: dashboards • More intermediate data precomputed • Data stored in Hbase • Dashboards go against HBase • Templates for users to build custom dashboards 22 © Karmasphere 2011 All rights reserved
  • 23. V3: What did we learn • Self serve is the way to go • Give people the infrastructure and the support libraries and they’ll go to town • Some tasks still can’t be done in a framework and needs custom code • Machine learning, with analysis on R • ML is hard, even with experience • Data is not clean • Some content is very small • Comments on pictures and videos (workarounds for aggregation) • Even then you can build products around the results • People and network recommenders • Network categories for ads 23 © Karmasphere 2011 All rights reserved
  • 24. How would we do it differently today • Open source obviates custom code • Scribe for data ingestion • Hive for self serve analytics and business intelligence • Pig scripts subsume most of the Java code • Cascading for Java map-reduce • Dashboards still stay the same 24 © Karmasphere 2011 All rights reserved
  • 25. Epilogue • ML analysis showed most usage is spam • Shutdown a lot of pr0n networks and video hosting networks in far east Asia • Team moved to different companies • Still in analytics at LI, FB, and twitter • Company changed business model to for pay only and laid off half the staff 6 months later • Company acquired recently 25 © Karmasphere 2011 All rights reserved
  • 26. Takeaway • The problems and solutions are mostly the same everywhere • Getting data into Hadoop • How do you compute over the data • Getting meaningful data out of Hadoop • Lots of software components exist to help you with these • It is about the balance of what you develop vs what you acquire 26 © Karmasphere 2011 All rights reserved
  • 27. Q&A 27 © Karmasphere 2011 All rights reserved
  • 28. The Leader in Big Data Intelligence on Hadoop www.karmasphere.com 28 © Karmasphere 2011 All rights reserved