SlideShare uma empresa Scribd logo
1 de 56
How Salesforce.com Uses Hadoop
Some Data Science Use Cases
Narayan Bharadwaj             Jed Crosby
salesforce.com                salesforce.com
    @nadubharadwaj                @JedCrosby
Safe Harbor
 Safe harbor statement under the Private Securities Litigation Reform Act of 1995:

 This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties
 materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results
 expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be
 deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other
 financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any
 statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services.

 The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new
 functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our
 operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of
 intellectual property and other litigation, risks associated with possible mergers and acquisitions, the immature market in which we
 operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new
 releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization
 and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of
 salesforce.com, inc. is included in our annual report on Form 10-Q for the most recent fiscal quarter ended July 31, 2012. This
 documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of
 our Web site.

 Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently
 available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based
 upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-
 looking statements.
Agenda

  • Technology
  • Hadoop use cases
  • Use case discussion
     • Product Metrics
     • User Behavior Analysis
     • Collaborative Filtering
  • Q&A
             Every time you see the elephant, we will attempt to explain a
             Hadoop related concept.
Got “Cloud Data”?




            130k customers      800 million transactions/day
            Millions of users   Terabytes/day
Technology
Hadoop Overview

 - Started by Doug Cutting at Yahoo!
 - Based on two Google papers
    Google File System (GFS): http://research.google.com/archive/gfs.html
    Google MapReduce: http://research.google.com/archive/mapreduce.html
 - Hadoop is an open source Apache project
    Hadoop Distributed File System (HDFS)
    Distributed Processing Framework (MapReduce)
 - Several related projects
    HBase, Hive, Pig, Flume, ZooKeeper, Mahout, Oozie, HCatalog
Our Hadoop Ecosystem




   Apache Pig
Contributions
    @pRaShAnT1784 : Prashant Kommireddi




   Lars Hofhansl              @thefutureian : Ian Varley
Use Cases
Hadoop Use Cases
                            User behavior
   Product Metrics                                      Capacity planning
                              analysis



      Monitoring                                         Query Runtime
                             Collections
     intelligence                                          Prediction



 Early Warning System   Collaborative Filtering         Search Relevancy


                                        Internal App
                                         Internal App    Product feature
                                                         Product feature
Product Metrics
Product Metrics – Problem Statement


   Track feature usage/adoption across 130k+ customers
     Eg: Accounts, Contacts, Visualforce, Apex,…


   Track standard metrics across all features
     Eg: #Requests, #UniqueOrgs, #UniqueUsers, AvgResponseTime,…


   Track features and metrics across all channels
     API, UI, Mobile


   Primary audience: Executives, Product Managers
Data Pipeline

                                                       Fancy UI
          Feature (What?)
                                                      (Visualize)



         Feature Metadata                           Daily Summary
         (Instrumentation)                             (Output)




                                     Crunch it
                                      (How?)




                             Storage & Processing
Product Metrics Pipeline

                     User Input
                      User Input
                                                                                      Reports, Dashboards
                                                                                      Reports, Dashboards
                   (Page Layout)
                    (Page Layout)




                                                                                                          Formula
        Workflow




                                                                                                          Formula
        Workflow




                                                                                                           Fields
                                                                                                           Fields
                    Feature Metrics
                     Feature Metrics                                                 Trend Metrics
                                                                                      Trend Metrics
                    (Custom Object)
                     (Custom Object)                                                 (Custom Object)
                                                                                      (Custom Object)




                                       API




                                                                               API
                                              Client Machine
                                              Client Machine
                                               Java Program
                                                Java Program
                                             Pig script generator
                                             Pig script generator




                                                                    Workflow
                                                                    Workflow




                                                                                               Log Pull
                                                                                               Log Pull
                                               Hadoop
                                               Hadoop                                                               Log Files
                                                                                                                     Log Files
Feature Metrics (Custom Object)

Id      Feature Name     PM      Instrumentation   Metric1     Metric2     Metric3      Metric4   Status

F0001   Accounts         John    /001              #requests   #UniqOrgs   #UniqUsers   AvgRT     Dev
F0002   Contacts         Nancy   /003              #requests   #UniqOrgs   #UniqUsers   AvgRT     Review
F0003   API              Eric    A                 #requests   #UniqOrgs   #UniqUsers   AvgRT     Deployed


F0004   Visualforce      Roger   V                 #requests   #UniqOrgs   #UniqUsers   AvgRT     Decom


F0005   Apex             Kim     axapx             #requests   #UniqOrgs   #UniqUsers   AvgRT     Deployed
F0006   Custom Objects   Chun    /aXX              #requests   #UniqOrgs   #UniqUsers   AvgRT     Deployed


F0008   Chatter          Jed     chcmd             #requests   #UniqOrgs   #UniqUsers   AvgRT     Deployed
F0009   Reports          Steve   R                 #requests   #UniqOrgs   #UniqUsers   AvgRT     Deployed
Feature Metrics (Custom Object)
User Input (Page Layout)
                           Formula
                           Field




                             Workflow
                             Rule
User Input (Child Custom Object)




                                   Child
                                   Objects
Apache Pig
Basic Pig Script Construct
        -- Define UDFs
        DEFINE GFV GetFieldValue(‘/path/to/udf/file’);
        -- Load data
        A = LOAD ‘/path/to/cloud/data/log/files’ USING PigStorage();
        -- Filter data
        B = FILTER A BY GFV(row, ‘logRecordType’) == ‘U’;
        -- Extract Fields
        C = FOREACH B GENERATE GFV(*, ‘orgId’), LFV(*. ‘userId’) ……..
        -- Group
        G = GROUP C BY ……
        -- Compute output metrics
        O = FOREACH G {
                                orgs = C.orgId; uniqueOrgs = DISTINCT orgs;
                            }
        -- Store or Dump results
        STORE O INTO ‘/path/to/user/output’;
Java Pig Script Generator (Client)
Trend Metrics (Custom Object)



 Id       Date          #Requests   #Unique Orgs   #Unique Users   Avg ResponseTime

  F0001    06/01/2012       <big>      <big>           <big>            <little>

  F0002    06/01/2012       <big>      <big>           <big>            <little>

  F0003    06/01/2012       <big>      <big>           <big>            <little>

  F0001    06/02/2012       <big>      <big>           <big>            <little>

  F0002    06/02/2012       <big>      <big>           <big>            <little>

  F0003    06/03/2012       <big>      <big>           <big>            <little>
Upload to Trend Metrics (Custom Object)
Visualization (Reports & Dashboards)
Visualization (Reports & Dashboards)
Collaborate, Iterate (Chatter)
Recap

                     User Input
                      User Input
                                                                                      Reports, Dashboards
                                                                                      Reports, Dashboards
                   (Page Layout)
                    (Page Layout)




                                                                                                          Formula
        Workflow




                                                                                                          Formula
        Workflow




                                                                                                           Fields
                                                                                                           Fields
                    Feature Metrics
                     Feature Metrics                                                 Trend Metrics
                                                                                      Trend Metrics
                    (Custom Object)
                     (Custom Object)                                                 (Custom Object)
                                                                                      (Custom Object)




                                       API




                                                                               API
                                              Client Machine
                                              Client Machine
                                               Java Program
                                                Java Program
                                             Pig script generator
                                             Pig script generator




                                                                    Workflow
                                                                    Workflow




                                                                                               Log Pull
                                                                                               Log Pull
                                               Hadoop
                                               Hadoop                                                               Log Files
                                                                                                                     Log Files
User Behavior Analysis
Problem Statement

 How do we reduce number of clicks on the user interface?
 Need to understand top user click paths. What are they typically trying to do?
 What are the user clusters/personas?


Approach:
•   Markov transition for click path, D3.js visuals
•   K-means (unsupervised) clustering for user groups
Markov Transitions for "Setup" Pages
K-means clustering of "Setup" Pages
Collaborative Filtering

        Jed Crosby
Collaborative Filtering – Problem Statement


   Show similar files within an organization
      Content-based approach
      Community-base approach
Popular File
Related File
We found this relationship using item-to-item collaborative
filtering



   Amazon published this algorithm in 2003.
      Amazon.com Recommendations: Item-to-Item Collaborative Filtering, by
       Gregory Linden, Brent Smith, and Jeremy York. IEEE Internet Computing,
       January-February 2003.


   At Salesforce, we adapted this algorithm for Hadoop, and we use
   it to recommend files to view and users to follow.
Example: CF on 5 files
                                                    Vision Statement
                Annual Report




Dilbert Comic
                                                           Darth Vader Cartoon




                                Disk Usage Report
View History Table

                Annual   Vision      Dilbert   Darth Vader   Disk Usage
                Report   Statement   Cartoon   Cartoon       Report
 Miranda
                     1        1           1         0             0
 (CEO)

 Bob (CFO)           1        1           1         0             0

 Susan
                     0        1           1         1             0
 (Sales)

 Chun (Sales)        0        0           1         1             0

 Alice (IT)          0        0           1         1             1
Relationships Between the Files



                    Annual Report                Vision Statement




                                                                    Darth Vader
                                                                    Cartoon
         Dilbert Cartoon




                                    Disk Usage
                                    Report
Relationships Between the Files


                        Annual Report
                                                 2           Vision Statement




                                                         0               1
                                            3
                        2

                                                             0                  Darth Vader
                                        0                                       Cartoon
           Dilbert Cartoon
                                                  3



                                  1                              1



                                                Disk Usage
                                                Report
Sorted Relationships for Each File



Annual                  Vision                 Dilbert                Darth Vader            Disk Usage
Report                  Statement              Cartoon                Cartoon                Report
Dilbert (2)             Dilbert (3)            Vision Stmt. (3)       Dilbert (3)            Dilbert (1)
Vision Stmt. (2)        Annual Rpt. (2)        Darth Vader (3)        Vision Stmt. (1)       Darth Vader (1)


                        Darth Vader (1)        Annual Rpt. (2)        Disk Usage (1)
                                               Disk Usage (1)

               The popularity problem: notice that Dilbert appears first in every list. This is
               probably not what we want.


               The solution: divide the relationship tallies by file popularities.
Normalized Relationships Between the Files


                    Annual Report               .82                  Vision Statement




                                                             0                   .33
                      .63                 .77


                                                                 0
                                      0                                                 Darth Vader
          Dilbert Cartoon                                                               Cartoon
                                                  .77



                                .45                                   .58



                                                Disk Usage
                                                Report
Sorted relationships for each file, normalized by file popularities




Annual Report              Vision                Dilbert                Darth Vader            Disk Usage
                           Statement             Cartoon                Cartoon                Report
Vision Stmt.               Annual Report         Darth Vader                                   Darth Vader
                                                                        Dilbert (.77)
(.82)                      (.82)                 (.77)                                         (.58)
                                                 Vision Stmt.           Disk Usage             Dilbert
Dilbert (.63)              Dilbert (.77)
                                                 (.77)                  (.58)                  (.45)
                           Darth Vader           Annual Report          Vision Stmt.
                           (.33)                 (.63)                  (.33)
                                                 Disk Usage
                                                 (.45)




                High relationship tallies AND similar popularity values now drive closeness.
The Item-to-Item CF Algorithm




  1) Compute file popularities
  2) Compute relationship tallies and divide by file popularities
  3) Sort and store the results
MapReduce Overview
    Map                           Shuffle                       Reduce




          (adapted from http://code.google.com/p/mapreduce-framework/wiki/MapReduce)
1. Compute File Popularities

                                           <user, file>


                                                       Inverse identity map


                                        <file, List<user>>

                                                        Reduce


                                        <file, (user count)>

       Result is a table of (file, popularity) pairs that you store in the Hadoop distributed cache.
Example: File popularity for Dilbert



  (Miranda, Dilbert), (Bob, Dilbert), (Susan, Dilbert), (Chun, Dilbert), (Alice, Dilbert)


                                                      Inverse identity map



                          <Dilbert, {Miranda, Bob, Susan, Chun, Alice}>


                                                      Reduce


                                             (Dilbert, 5)
2a. Compute Relationship Tallies − Find All Relationships in View History Table


                                       <user, file>

                                                   Identity map

                                     <user, List<file>>

                                                   Reduce


                               <(file1, file2), Integer(1)>,
                               <(file1, file3), Integer(1)>,
                               …
                               <(file(n-1), file(n)), Integer(1)>


           Relationships have their file IDs in alphabetical order to avoid double counting.
Example 2a: Miranda’s (CEO) File Relationship Votes




        (Miranda, Annual Report), (Miranda, Vision Statement), (Miranda, Dilbert)

                                                Identity map

                 <Miranda, {Annual Report, Vision Statement, Dilbert}>

                                                 Reduce


                       <(Annual Report, Dilbert), Integer(1)>,
                       <(Annual Report, Vision Statement), Integer(1)>,
                       <(Dilbert, Vision Statement), Integer(1)>
2b. Tally the Relationship Votes − Just a Word Count, Where Each
Relationship Occurrence is a Word

                                <(file1, file2), Integer(1)>

                                                    Identity map


                              <(file1, file2), List<Integer(1)>


                                                    Reduce: count and divide
                                                    by popularities

            <file1, (file2, similarity score)>, <file2, (file1, similarity score)>


                   Note that we emit each result twice,
                   one for each file that belongs to a relationship.
Example 2b: the Dilbert/Darth Vader Relationship



                            <(Dilbert, Vader), Integer(1)>,
                            <(Dilbert, Vader), Integer(1)>,
                            <(Dilbert, Vader), Integer(1)>

                                                 Identity map


                              <(Dilbert, Vader), {1, 1, 1}>


                                                 Reduce: count and divide
                                                 by popularities


               <Dilbert, (Vader, sqrt(3/5))>, <Vader, (Dilbert, sqrt(3/5))>
3. Sort and Store Results


                            <file1, (file2, similarity score)>

                                                  Identity map


                       <file1, List<(file2, similarity score)>>

                                                   Reduce


                             <file1, {top n similar files}>




                    Store the results in your location of choice
Example 3: Sorting the Results for Dilbert

                                 <Dilbert, (Annual Report, .63)>,
                                 <Dilbert, (Vision Statement, .77)>,
                                 <Dilbert, (Disk Usage, .45)>,
                                 <Dilbert, (Darth Vader, .77)>

                                                      Identity map


   <Dilbert, {(Annual Report, .63), (Vision Statement, .77), (Disk Usage, .45), (Darth Vader, .77)}>

                                                      Reduce


                     <Dilbert, {Darth Vader, Vision Statement}> (Top 2 files)



                                           Store results
Appendix



  Cosine formula and normalization trick to avoid the distributed
  cache
                            A •B   A   B
                cos θAB   =      =   •
                            A B    A   B

  Mahout has CF
  Asymptotic order of the algorithm is O(M*N2) in worst case, but is
  helped by sparsity.
Narayan Bharadwaj           Jed Crosby
Director, Product Management   Data Scientist
      @nadubharadwaj           @JedCrosby
Dreamforce_2012_Hadoop_Use_Cases

Mais conteúdo relacionado

Semelhante a Dreamforce_2012_Hadoop_Use_Cases

How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses HadoopNarayan Bharadwaj
 
Emakina Academy 6 - Boost your intranet - Web Content Management for SAP
Emakina Academy 6 - Boost your intranet - Web Content Management for SAPEmakina Academy 6 - Boost your intranet - Web Content Management for SAP
Emakina Academy 6 - Boost your intranet - Web Content Management for SAPEmakina
 
The SENSORIA Development Environment
The SENSORIA Development EnvironmentThe SENSORIA Development Environment
The SENSORIA Development EnvironmentIstvan Rath
 
The Web Development Eco-system with VSTS, ASP.NET 2.0 & Microsoft Ajax
The Web Development Eco-system with VSTS, ASP.NET 2.0 & Microsoft AjaxThe Web Development Eco-system with VSTS, ASP.NET 2.0 & Microsoft Ajax
The Web Development Eco-system with VSTS, ASP.NET 2.0 & Microsoft AjaxDarren Sim
 
SIMPDA 2011 - An Open Source Platform for Business Process Mining
SIMPDA 2011 - An Open Source Platform for Business Process Mining SIMPDA 2011 - An Open Source Platform for Business Process Mining
SIMPDA 2011 - An Open Source Platform for Business Process Mining SpagoWorld
 
Webinar - An Open Source Platform for Business Process Mining
Webinar - An Open Source Platform for Business Process MiningWebinar - An Open Source Platform for Business Process Mining
Webinar - An Open Source Platform for Business Process MiningSpagoWorld
 
Java micro-services
Java micro-servicesJava micro-services
Java micro-servicesJames Lewis
 
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure PlatformVitor Tomaz
 
Abap web dynpro
Abap   web dynproAbap   web dynpro
Abap web dynpromanojdhir
 
Abap web dynpro
Abap   web dynproAbap   web dynpro
Abap web dynpromanojdhir
 
Webinar: SpagoBI Suite
Webinar: SpagoBI SuiteWebinar: SpagoBI Suite
Webinar: SpagoBI SuiteSpagoWorld
 
Study of solution development methodology for small size projects.
Study of solution development methodology for small size projects.Study of solution development methodology for small size projects.
Study of solution development methodology for small size projects.Joon ho Park
 
6.Live Framework 和Mesh Services
6.Live Framework 和Mesh Services6.Live Framework 和Mesh Services
6.Live Framework 和Mesh ServicesGaryYoung
 
Why EPM Live? EPM Live Overview and Demo
Why EPM Live? EPM Live Overview and DemoWhy EPM Live? EPM Live Overview and Demo
Why EPM Live? EPM Live Overview and DemoEPM Live
 
Enterprise search in SharePoint 2013 - Sydney 15th of January 2013
Enterprise search in SharePoint 2013 - Sydney 15th of January 2013Enterprise search in SharePoint 2013 - Sydney 15th of January 2013
Enterprise search in SharePoint 2013 - Sydney 15th of January 2013Findwise
 
Track and Trace Solution Details
Track and Trace Solution DetailsTrack and Trace Solution Details
Track and Trace Solution DetailsPropix Technologies
 
Science Modernisation Strategy v1 0
Science  Modernisation  Strategy v1 0Science  Modernisation  Strategy v1 0
Science Modernisation Strategy v1 0Salim Sheikh
 

Semelhante a Dreamforce_2012_Hadoop_Use_Cases (20)

How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
 
Emakina Academy 6 - Boost your intranet - Web Content Management for SAP
Emakina Academy 6 - Boost your intranet - Web Content Management for SAPEmakina Academy 6 - Boost your intranet - Web Content Management for SAP
Emakina Academy 6 - Boost your intranet - Web Content Management for SAP
 
The SENSORIA Development Environment
The SENSORIA Development EnvironmentThe SENSORIA Development Environment
The SENSORIA Development Environment
 
The Web Development Eco-system with VSTS, ASP.NET 2.0 & Microsoft Ajax
The Web Development Eco-system with VSTS, ASP.NET 2.0 & Microsoft AjaxThe Web Development Eco-system with VSTS, ASP.NET 2.0 & Microsoft Ajax
The Web Development Eco-system with VSTS, ASP.NET 2.0 & Microsoft Ajax
 
SIMPDA 2011 - An Open Source Platform for Business Process Mining
SIMPDA 2011 - An Open Source Platform for Business Process Mining SIMPDA 2011 - An Open Source Platform for Business Process Mining
SIMPDA 2011 - An Open Source Platform for Business Process Mining
 
Webinar - An Open Source Platform for Business Process Mining
Webinar - An Open Source Platform for Business Process MiningWebinar - An Open Source Platform for Business Process Mining
Webinar - An Open Source Platform for Business Process Mining
 
Java micro-services
Java micro-servicesJava micro-services
Java micro-services
 
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
 
Fosdem2010 Faban
Fosdem2010 FabanFosdem2010 Faban
Fosdem2010 Faban
 
Sap overview
Sap overviewSap overview
Sap overview
 
Sap overview
Sap overviewSap overview
Sap overview
 
Abap web dynpro
Abap   web dynproAbap   web dynpro
Abap web dynpro
 
Abap web dynpro
Abap   web dynproAbap   web dynpro
Abap web dynpro
 
Webinar: SpagoBI Suite
Webinar: SpagoBI SuiteWebinar: SpagoBI Suite
Webinar: SpagoBI Suite
 
Study of solution development methodology for small size projects.
Study of solution development methodology for small size projects.Study of solution development methodology for small size projects.
Study of solution development methodology for small size projects.
 
6.Live Framework 和Mesh Services
6.Live Framework 和Mesh Services6.Live Framework 和Mesh Services
6.Live Framework 和Mesh Services
 
Why EPM Live? EPM Live Overview and Demo
Why EPM Live? EPM Live Overview and DemoWhy EPM Live? EPM Live Overview and Demo
Why EPM Live? EPM Live Overview and Demo
 
Enterprise search in SharePoint 2013 - Sydney 15th of January 2013
Enterprise search in SharePoint 2013 - Sydney 15th of January 2013Enterprise search in SharePoint 2013 - Sydney 15th of January 2013
Enterprise search in SharePoint 2013 - Sydney 15th of January 2013
 
Track and Trace Solution Details
Track and Trace Solution DetailsTrack and Trace Solution Details
Track and Trace Solution Details
 
Science Modernisation Strategy v1 0
Science  Modernisation  Strategy v1 0Science  Modernisation  Strategy v1 0
Science Modernisation Strategy v1 0
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Dreamforce_2012_Hadoop_Use_Cases

  • 1. How Salesforce.com Uses Hadoop Some Data Science Use Cases Narayan Bharadwaj Jed Crosby salesforce.com salesforce.com @nadubharadwaj @JedCrosby
  • 2. Safe Harbor Safe harbor statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of intellectual property and other litigation, risks associated with possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-Q for the most recent fiscal quarter ended July 31, 2012. This documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward- looking statements.
  • 3. Agenda • Technology • Hadoop use cases • Use case discussion • Product Metrics • User Behavior Analysis • Collaborative Filtering • Q&A Every time you see the elephant, we will attempt to explain a Hadoop related concept.
  • 4. Got “Cloud Data”? 130k customers 800 million transactions/day Millions of users Terabytes/day
  • 6. Hadoop Overview - Started by Doug Cutting at Yahoo! - Based on two Google papers  Google File System (GFS): http://research.google.com/archive/gfs.html  Google MapReduce: http://research.google.com/archive/mapreduce.html - Hadoop is an open source Apache project  Hadoop Distributed File System (HDFS)  Distributed Processing Framework (MapReduce) - Several related projects  HBase, Hive, Pig, Flume, ZooKeeper, Mahout, Oozie, HCatalog
  • 7. Our Hadoop Ecosystem Apache Pig
  • 8. Contributions @pRaShAnT1784 : Prashant Kommireddi Lars Hofhansl @thefutureian : Ian Varley
  • 10. Hadoop Use Cases User behavior Product Metrics Capacity planning analysis Monitoring Query Runtime Collections intelligence Prediction Early Warning System Collaborative Filtering Search Relevancy Internal App Internal App Product feature Product feature
  • 12. Product Metrics – Problem Statement Track feature usage/adoption across 130k+ customers  Eg: Accounts, Contacts, Visualforce, Apex,… Track standard metrics across all features  Eg: #Requests, #UniqueOrgs, #UniqueUsers, AvgResponseTime,… Track features and metrics across all channels  API, UI, Mobile Primary audience: Executives, Product Managers
  • 13. Data Pipeline Fancy UI Feature (What?) (Visualize) Feature Metadata Daily Summary (Instrumentation) (Output) Crunch it (How?) Storage & Processing
  • 14. Product Metrics Pipeline User Input User Input Reports, Dashboards Reports, Dashboards (Page Layout) (Page Layout) Formula Workflow Formula Workflow Fields Fields Feature Metrics Feature Metrics Trend Metrics Trend Metrics (Custom Object) (Custom Object) (Custom Object) (Custom Object) API API Client Machine Client Machine Java Program Java Program Pig script generator Pig script generator Workflow Workflow Log Pull Log Pull Hadoop Hadoop Log Files Log Files
  • 15. Feature Metrics (Custom Object) Id Feature Name PM Instrumentation Metric1 Metric2 Metric3 Metric4 Status F0001 Accounts John /001 #requests #UniqOrgs #UniqUsers AvgRT Dev F0002 Contacts Nancy /003 #requests #UniqOrgs #UniqUsers AvgRT Review F0003 API Eric A #requests #UniqOrgs #UniqUsers AvgRT Deployed F0004 Visualforce Roger V #requests #UniqOrgs #UniqUsers AvgRT Decom F0005 Apex Kim axapx #requests #UniqOrgs #UniqUsers AvgRT Deployed F0006 Custom Objects Chun /aXX #requests #UniqOrgs #UniqUsers AvgRT Deployed F0008 Chatter Jed chcmd #requests #UniqOrgs #UniqUsers AvgRT Deployed F0009 Reports Steve R #requests #UniqOrgs #UniqUsers AvgRT Deployed
  • 17. User Input (Page Layout) Formula Field Workflow Rule
  • 18. User Input (Child Custom Object) Child Objects
  • 20. Basic Pig Script Construct -- Define UDFs DEFINE GFV GetFieldValue(‘/path/to/udf/file’); -- Load data A = LOAD ‘/path/to/cloud/data/log/files’ USING PigStorage(); -- Filter data B = FILTER A BY GFV(row, ‘logRecordType’) == ‘U’; -- Extract Fields C = FOREACH B GENERATE GFV(*, ‘orgId’), LFV(*. ‘userId’) …….. -- Group G = GROUP C BY …… -- Compute output metrics O = FOREACH G { orgs = C.orgId; uniqueOrgs = DISTINCT orgs; } -- Store or Dump results STORE O INTO ‘/path/to/user/output’;
  • 21. Java Pig Script Generator (Client)
  • 22. Trend Metrics (Custom Object) Id Date #Requests #Unique Orgs #Unique Users Avg ResponseTime F0001 06/01/2012 <big> <big> <big> <little> F0002 06/01/2012 <big> <big> <big> <little> F0003 06/01/2012 <big> <big> <big> <little> F0001 06/02/2012 <big> <big> <big> <little> F0002 06/02/2012 <big> <big> <big> <little> F0003 06/03/2012 <big> <big> <big> <little>
  • 23. Upload to Trend Metrics (Custom Object)
  • 27. Recap User Input User Input Reports, Dashboards Reports, Dashboards (Page Layout) (Page Layout) Formula Workflow Formula Workflow Fields Fields Feature Metrics Feature Metrics Trend Metrics Trend Metrics (Custom Object) (Custom Object) (Custom Object) (Custom Object) API API Client Machine Client Machine Java Program Java Program Pig script generator Pig script generator Workflow Workflow Log Pull Log Pull Hadoop Hadoop Log Files Log Files
  • 29. Problem Statement  How do we reduce number of clicks on the user interface?  Need to understand top user click paths. What are they typically trying to do?  What are the user clusters/personas? Approach: • Markov transition for click path, D3.js visuals • K-means (unsupervised) clustering for user groups
  • 30. Markov Transitions for "Setup" Pages
  • 31. K-means clustering of "Setup" Pages
  • 33. Collaborative Filtering – Problem Statement Show similar files within an organization  Content-based approach  Community-base approach
  • 36. We found this relationship using item-to-item collaborative filtering Amazon published this algorithm in 2003.  Amazon.com Recommendations: Item-to-Item Collaborative Filtering, by Gregory Linden, Brent Smith, and Jeremy York. IEEE Internet Computing, January-February 2003. At Salesforce, we adapted this algorithm for Hadoop, and we use it to recommend files to view and users to follow.
  • 37. Example: CF on 5 files Vision Statement Annual Report Dilbert Comic Darth Vader Cartoon Disk Usage Report
  • 38. View History Table Annual Vision Dilbert Darth Vader Disk Usage Report Statement Cartoon Cartoon Report Miranda 1 1 1 0 0 (CEO) Bob (CFO) 1 1 1 0 0 Susan 0 1 1 1 0 (Sales) Chun (Sales) 0 0 1 1 0 Alice (IT) 0 0 1 1 1
  • 39. Relationships Between the Files Annual Report Vision Statement Darth Vader Cartoon Dilbert Cartoon Disk Usage Report
  • 40. Relationships Between the Files Annual Report 2 Vision Statement 0 1 3 2 0 Darth Vader 0 Cartoon Dilbert Cartoon 3 1 1 Disk Usage Report
  • 41. Sorted Relationships for Each File Annual Vision Dilbert Darth Vader Disk Usage Report Statement Cartoon Cartoon Report Dilbert (2) Dilbert (3) Vision Stmt. (3) Dilbert (3) Dilbert (1) Vision Stmt. (2) Annual Rpt. (2) Darth Vader (3) Vision Stmt. (1) Darth Vader (1) Darth Vader (1) Annual Rpt. (2) Disk Usage (1) Disk Usage (1) The popularity problem: notice that Dilbert appears first in every list. This is probably not what we want. The solution: divide the relationship tallies by file popularities.
  • 42. Normalized Relationships Between the Files Annual Report .82 Vision Statement 0 .33 .63 .77 0 0 Darth Vader Dilbert Cartoon Cartoon .77 .45 .58 Disk Usage Report
  • 43. Sorted relationships for each file, normalized by file popularities Annual Report Vision Dilbert Darth Vader Disk Usage Statement Cartoon Cartoon Report Vision Stmt. Annual Report Darth Vader Darth Vader Dilbert (.77) (.82) (.82) (.77) (.58) Vision Stmt. Disk Usage Dilbert Dilbert (.63) Dilbert (.77) (.77) (.58) (.45) Darth Vader Annual Report Vision Stmt. (.33) (.63) (.33) Disk Usage (.45) High relationship tallies AND similar popularity values now drive closeness.
  • 44. The Item-to-Item CF Algorithm 1) Compute file popularities 2) Compute relationship tallies and divide by file popularities 3) Sort and store the results
  • 45. MapReduce Overview Map Shuffle Reduce (adapted from http://code.google.com/p/mapreduce-framework/wiki/MapReduce)
  • 46. 1. Compute File Popularities <user, file> Inverse identity map <file, List<user>> Reduce <file, (user count)> Result is a table of (file, popularity) pairs that you store in the Hadoop distributed cache.
  • 47. Example: File popularity for Dilbert (Miranda, Dilbert), (Bob, Dilbert), (Susan, Dilbert), (Chun, Dilbert), (Alice, Dilbert) Inverse identity map <Dilbert, {Miranda, Bob, Susan, Chun, Alice}> Reduce (Dilbert, 5)
  • 48. 2a. Compute Relationship Tallies − Find All Relationships in View History Table <user, file> Identity map <user, List<file>> Reduce <(file1, file2), Integer(1)>, <(file1, file3), Integer(1)>, … <(file(n-1), file(n)), Integer(1)> Relationships have their file IDs in alphabetical order to avoid double counting.
  • 49. Example 2a: Miranda’s (CEO) File Relationship Votes (Miranda, Annual Report), (Miranda, Vision Statement), (Miranda, Dilbert) Identity map <Miranda, {Annual Report, Vision Statement, Dilbert}> Reduce <(Annual Report, Dilbert), Integer(1)>, <(Annual Report, Vision Statement), Integer(1)>, <(Dilbert, Vision Statement), Integer(1)>
  • 50. 2b. Tally the Relationship Votes − Just a Word Count, Where Each Relationship Occurrence is a Word <(file1, file2), Integer(1)> Identity map <(file1, file2), List<Integer(1)> Reduce: count and divide by popularities <file1, (file2, similarity score)>, <file2, (file1, similarity score)> Note that we emit each result twice, one for each file that belongs to a relationship.
  • 51. Example 2b: the Dilbert/Darth Vader Relationship <(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)> Identity map <(Dilbert, Vader), {1, 1, 1}> Reduce: count and divide by popularities <Dilbert, (Vader, sqrt(3/5))>, <Vader, (Dilbert, sqrt(3/5))>
  • 52. 3. Sort and Store Results <file1, (file2, similarity score)> Identity map <file1, List<(file2, similarity score)>> Reduce <file1, {top n similar files}> Store the results in your location of choice
  • 53. Example 3: Sorting the Results for Dilbert <Dilbert, (Annual Report, .63)>, <Dilbert, (Vision Statement, .77)>, <Dilbert, (Disk Usage, .45)>, <Dilbert, (Darth Vader, .77)> Identity map <Dilbert, {(Annual Report, .63), (Vision Statement, .77), (Disk Usage, .45), (Darth Vader, .77)}> Reduce <Dilbert, {Darth Vader, Vision Statement}> (Top 2 files) Store results
  • 54. Appendix Cosine formula and normalization trick to avoid the distributed cache A •B A B cos θAB = = • A B A B Mahout has CF Asymptotic order of the algorithm is O(M*N2) in worst case, but is helped by sparsity.
  • 55. Narayan Bharadwaj Jed Crosby Director, Product Management Data Scientist @nadubharadwaj @JedCrosby

Notas do Editor

  1. Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key
  2. Workflow rectangle – move to the other side
  3. Custom objects are custom database tables that allow you to store information unique to your organization.
  4. WSC tool to consume the enterprise WSDL to put summary data back into Trend Metrics
  5. Workflow rectangle – move to the other side