SlideShare uma empresa Scribd logo
1 de 48
Data Platform
Vipul Sharma – vipul@eventbrite.com
A social event ticketing and discovery platform
$1B total sales
  68M tickets sold
 1.4M events hosted
.5M organizers served
23M attendees served
    12 countries
Post Event       Conception




Organization        Event Lifecycle          Creation




                  Sale          Discovery
Frictionless is the mantra!
Data Platform and Discovery
• Search
Discovery    • Recommendation
             • Social


             • Data warehouse and Metrics
Analytics    • Internal and External reporting
             • Real Time and Batch Analytics


  Abuse      • Spam
             • Fraud
Prevention   • TOS
Analytics




            • Add–Hoc queries by Analysts
Fraud and Spam
Data Platform
Hadoop Cluster


•   30 persistent EC2 High-Memory Instances
•   30TB disk with replication factor of 2, ext3 formatted
•   CDH3
•   Fair Scheduler
•   HBase
Infrastructure

• Search
   • Solr
   • Incremental updates towards event driven
• Recommendation/Graph
   • Hadoop
   • Native Java MapReduce
   • Bash for workflow
• Social
   • Cassandra
   • Denormalizedvview
• Persistence
   •   MySql
   •   HDFS
   •   HBase
   •   MongoDB (Moving to Cassandra)
Infrastructure


• Stream
   • RabbitMQ
   • Internal Fire hose
   • Storm
• Offline
   •   MapRedude
   •   Streaming
   •   Hive
   •   Hue
Discovery
Social, Interest, Local
Attendees


 Events




Organizers
Categorization - Prism




                             Tech
                         Conference
                                  Music

                             Sports
Prism - Features



•   Supervised Learning
•   Logistic Regression using MLE
•   Pair wise classification into 20 categories
•   High precision lower recall
•   Use mapreduce for feature extraction
•   Use for clustering as well
Prism – Training Data


• Binary classification for each category
• Training data needed for positive and negative
   • Conference and not Conference
   • Sports and not Sports
• Samasource and Crowdflower
• Stem words to create initial set
• Positive, negative, negative with stem words
Prism - Features


• Convert Event and Organizer data in feature vector
• Event details, Organizer details, Ticket details
• Boolean representation of predefined attributes
   •   Words – tf-idf, dictonaries
   •   Phrases
   •   Domains
   •   Rules – regular expression
   •   Functions – business logic e.g. ticket price between $10-$20
   •   Compounds – boolean combination of features & and || rules
            – <COMPOUND1>:techcrunch& disrupt &techcrunch.com
            – <COMPOUND2>:COMPOUND2 && after && party
Prism - Features

• Each feature is represented in various context
   • Event Title, Event Description, Organizer Title, Organizer
     Description
• Each feature has meta info – Termclass
   • <LANG_EN>, <CONF_LANG_EN>,<ADULT_LANG_EN>
   • <SPORTS_LANG_EN>:<EVENT_TITLE>ball
• Feature vector is represented as sparse vector

+1 391158:1 401814:1 410526:1 411489:1 411606:2 413910:1
  427659:1 438369:1 449735:1 449736:2 455478:1 456741:1
  463188:1
693|||||warrior spirit's 3rd annual fundraising
  auction|||||1:<DESC>again,1:<NAME>annual,1:<DESC>annu
  al,2:<DESC>approaching,2:<NAME>auction,4:<DESC>auctio
  n,2:<DESC>auctions,2:<DESC>bring
Prism - Training


•   Binary classifier
•   Multiclass less accurate
•   Each event get classified into 20 category
•   MapReduce for creating sparse matrix
•   MapReduce for batch classification
    • Distributed cache for feature set and models
• We can use same sparse matrix for clustering
Attendee


•   What your interests are? - Prism
•   Who your friends are? – Explicit and Implicit
•   What are the interests of your friends? - Prism
•   Which of your friend have your interests? – IBG
•   Location of users and events
    •   Purchase events location
    •   Facebook location
    •   Our database
    •   Other signals – ip, mobile app etc
You will like to attend this event
Recommendation Engines



                                                                                     Interest Graph
                                                                                     Based
                                                                 Social Graph
                                                                 Based (Your         (Your friends who
                                                                 friends like Lady   like rock music
                                          Collaborative          Gaga so you will    like you are
                                          Filtering – Item-      like Lady Gaga,     attending Eric
                                          Item similarity        PYMK – Facebook,    Clapton Event–
                                                                 Linkedin)           Eventbrite)
                      Collaborative       (You like
                      Filtering – User-   Godfather so you
                      User Similarity     will like Scarface -
                                          Netflix)
                      (People who
     Item             bought camera
     Hierarchy        also bought
                      batteries -
     (You bought      Amazon)
     camera so you
     need batteries
     - Amazon)
Why Interest?




  Events are Social          Events are Interest




Dense Graph is Irrelevant
                            Interest are Changing
How do we know your Interest?


• We ask you
• Based on your activity
   • Events Attended
   • Events Browsed (In Future)
• Facebook Interests
   • User Interest has to match Event category
   • Static
• Prism
Model Based vs Clustering

            Item-Item vs User-User

     Building Social Graph is Clustering Step

Social Graph Recommendation is a Ranking Problem
Implicit Social Graph


                                 U1


                            E1        E4

                  U2                       U3


             E2        E3

        U4                       U5
Mixed Social Graph


                                U1


                           E1

                 U2                  U3


            E2        E3
                                          FB
       U4                       U5
                                          LI
23M * 260 * 260 = 1.5 Trillion Edges
               6 Billion edges ranked
   Each node is a feature vector representing a User

Each edge is a feature vector representing a Relationship
Feature Generation

•   Mixed Features
•   A series of map-reduce jobs
•   Output on HDFS in flat files; Input to subsequent jobs
•   Orders = Event  Attendees
    • MAP: eid: uid
    • REDUCE: eid:[uid]
• Attendees  Social Graph
    • Input: eid:[uid]
    • MAP: uidi:[uid]
    • REDUCE: uid:[neighbors]
• Interest based features, user specific, graph mining etc
• Upload feature values to HBase
HBase



• Why Hbase?
   • To process 6B edges lookup features for each node and each
     edge
   • 6B/1000 /86400 = 70 days!!
   • 1M/sec = 1.5 hrs
   • Processing 1.3 TB of data with mapreduce
• Collect data from multiple Map Reduce jobs
   • Stores entire social graph
   • Features for each node and edge
Data Model


          Rowkey                            U                            UU

uid1                         f1        f2         f3         uid2:f4   uid2:f5   uid3:f4




rowid            neighbors        events          featureX
2718282          101              3               0.3678795


rowid       314159:n    314159:e      314159:fx    161803:n      161803:e     161803:fx
2718282     31          1             0.3183       83            2            0.618
U1




U2        U3
HBase
Hadoop Tips & Tricks

• Joins
   • Distributed cache
   • Hive map side joins
• Hive
   • Nice set of statistical functions
   • Lots of hive queries
• Hbase
   •   Lots of memory
   •   WAL
   •   LZO
   •   Proper configs
   •   Avoid hot regioservers
Hadoop tips & tricks


•   Combiners did not work
•   Shuffle and Merge
More Innovation


•   Rethink everything
•   Add social to search
•   Add time series features
•   Real time updates using firehose and storm
•   Various sorts of data
Developers! Developers! Developers!


• Interested in scaling, messaging, data, machine learning,
  mobile, services

• We will continue to push the boundaries of hard
  problems

• jobs@eventbrite.com
• vipul@eventbrite.com
Storm at Eventbrite

Tuesday August 21, 2012 at Eventbrite HQ

How we are using Storm for real time processing of our data

http://www.eventbrite.com/event/4010290888


     Andrew
     Whangwhang@eventbrite.co
     m
Questions?

Mais conteúdo relacionado

Destaque

Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Chris Jang
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Michael Rys
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data PlatformVikas Manoria
 
PNDA - Platform for Network Data Analytics
PNDA - Platform for Network Data AnalyticsPNDA - Platform for Network Data Analytics
PNDA - Platform for Network Data AnalyticsJohn Evans
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewAmazon Web Services
 

Destaque (7)

Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data Platform
 
PNDA - Platform for Network Data Analytics
PNDA - Platform for Network Data AnalyticsPNDA - Platform for Network Data Analytics
PNDA - Platform for Network Data Analytics
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data Platform
 
The AWS Big Data Platform – Overview
The AWS Big Data Platform – OverviewThe AWS Big Data Platform – Overview
The AWS Big Data Platform – Overview
 

Semelhante a Eventbrite Data Platform Talk foir SFDM

CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databasessjwoodman
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczIoan Toma
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari
 
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
InfiniteGraph Presentation from Oct 21, 2010 DBTA WebcastInfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
InfiniteGraph Presentation from Oct 21, 2010 DBTA WebcastInfiniteGraph
 
SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?Nicolas Georgeault
 
Impact AI 2020: Portfolio-Scale Data Science at Zynga
Impact AI 2020: Portfolio-Scale Data Science at ZyngaImpact AI 2020: Portfolio-Scale Data Science at Zynga
Impact AI 2020: Portfolio-Scale Data Science at ZyngaBen Weber
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Exploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban ForestryExploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban ForestryAzavea
 
Scratchpads past,present,future
Scratchpads past,present,futureScratchpads past,present,future
Scratchpads past,present,futureEdward Baker
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with sparkMarissa Saunders
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Benjamin Nussbaum
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 

Semelhante a Eventbrite Data Platform Talk foir SFDM (20)

Ashu Desc
Ashu DescAshu Desc
Ashu Desc
 
Testtting
TestttingTesttting
Testtting
 
Testtting
TestttingTesttting
Testtting
 
Eventbrite sxsw
Eventbrite sxswEventbrite sxsw
Eventbrite sxsw
 
CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databases
 
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczFOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
2014 july use_r
2014 july use_r2014 july use_r
2014 july use_r
 
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
InfiniteGraph Presentation from Oct 21, 2010 DBTA WebcastInfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
 
SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?
 
Impact AI 2020: Portfolio-Scale Data Science at Zynga
Impact AI 2020: Portfolio-Scale Data Science at ZyngaImpact AI 2020: Portfolio-Scale Data Science at Zynga
Impact AI 2020: Portfolio-Scale Data Science at Zynga
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Exploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban ForestryExploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban Forestry
 
Mapinfo 2014
Mapinfo 2014Mapinfo 2014
Mapinfo 2014
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Scratchpads past,present,future
Scratchpads past,present,futureScratchpads past,present,future
Scratchpads past,present,future
 
UNit4.pdf
UNit4.pdfUNit4.pdf
UNit4.pdf
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with spark
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Eventbrite Data Platform Talk foir SFDM

  • 1. Data Platform Vipul Sharma – vipul@eventbrite.com
  • 2. A social event ticketing and discovery platform
  • 3. $1B total sales 68M tickets sold 1.4M events hosted .5M organizers served 23M attendees served 12 countries
  • 4. Post Event Conception Organization Event Lifecycle Creation Sale Discovery
  • 6. Data Platform and Discovery
  • 7.
  • 8.
  • 9. • Search Discovery • Recommendation • Social • Data warehouse and Metrics Analytics • Internal and External reporting • Real Time and Batch Analytics Abuse • Spam • Fraud Prevention • TOS
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. Analytics • Add–Hoc queries by Analysts
  • 17.
  • 18. Hadoop Cluster • 30 persistent EC2 High-Memory Instances • 30TB disk with replication factor of 2, ext3 formatted • CDH3 • Fair Scheduler • HBase
  • 19. Infrastructure • Search • Solr • Incremental updates towards event driven • Recommendation/Graph • Hadoop • Native Java MapReduce • Bash for workflow • Social • Cassandra • Denormalizedvview • Persistence • MySql • HDFS • HBase • MongoDB (Moving to Cassandra)
  • 20. Infrastructure • Stream • RabbitMQ • Internal Fire hose • Storm • Offline • MapRedude • Streaming • Hive • Hue
  • 23. Categorization - Prism Tech Conference Music Sports
  • 24. Prism - Features • Supervised Learning • Logistic Regression using MLE • Pair wise classification into 20 categories • High precision lower recall • Use mapreduce for feature extraction • Use for clustering as well
  • 25. Prism – Training Data • Binary classification for each category • Training data needed for positive and negative • Conference and not Conference • Sports and not Sports • Samasource and Crowdflower • Stem words to create initial set • Positive, negative, negative with stem words
  • 26. Prism - Features • Convert Event and Organizer data in feature vector • Event details, Organizer details, Ticket details • Boolean representation of predefined attributes • Words – tf-idf, dictonaries • Phrases • Domains • Rules – regular expression • Functions – business logic e.g. ticket price between $10-$20 • Compounds – boolean combination of features & and || rules – <COMPOUND1>:techcrunch& disrupt &techcrunch.com – <COMPOUND2>:COMPOUND2 && after && party
  • 27. Prism - Features • Each feature is represented in various context • Event Title, Event Description, Organizer Title, Organizer Description • Each feature has meta info – Termclass • <LANG_EN>, <CONF_LANG_EN>,<ADULT_LANG_EN> • <SPORTS_LANG_EN>:<EVENT_TITLE>ball • Feature vector is represented as sparse vector +1 391158:1 401814:1 410526:1 411489:1 411606:2 413910:1 427659:1 438369:1 449735:1 449736:2 455478:1 456741:1 463188:1 693|||||warrior spirit's 3rd annual fundraising auction|||||1:<DESC>again,1:<NAME>annual,1:<DESC>annu al,2:<DESC>approaching,2:<NAME>auction,4:<DESC>auctio n,2:<DESC>auctions,2:<DESC>bring
  • 28. Prism - Training • Binary classifier • Multiclass less accurate • Each event get classified into 20 category • MapReduce for creating sparse matrix • MapReduce for batch classification • Distributed cache for feature set and models • We can use same sparse matrix for clustering
  • 29. Attendee • What your interests are? - Prism • Who your friends are? – Explicit and Implicit • What are the interests of your friends? - Prism • Which of your friend have your interests? – IBG • Location of users and events • Purchase events location • Facebook location • Our database • Other signals – ip, mobile app etc
  • 30. You will like to attend this event
  • 31. Recommendation Engines Interest Graph Based Social Graph Based (Your (Your friends who friends like Lady like rock music Collaborative Gaga so you will like you are Filtering – Item- like Lady Gaga, attending Eric Item similarity PYMK – Facebook, Clapton Event– Linkedin) Eventbrite) Collaborative (You like Filtering – User- Godfather so you User Similarity will like Scarface - Netflix) (People who Item bought camera Hierarchy also bought batteries - (You bought Amazon) camera so you need batteries - Amazon)
  • 32. Why Interest? Events are Social Events are Interest Dense Graph is Irrelevant Interest are Changing
  • 33. How do we know your Interest? • We ask you • Based on your activity • Events Attended • Events Browsed (In Future) • Facebook Interests • User Interest has to match Event category • Static • Prism
  • 34. Model Based vs Clustering Item-Item vs User-User Building Social Graph is Clustering Step Social Graph Recommendation is a Ranking Problem
  • 35. Implicit Social Graph U1 E1 E4 U2 U3 E2 E3 U4 U5
  • 36. Mixed Social Graph U1 E1 U2 U3 E2 E3 FB U4 U5 LI
  • 37. 23M * 260 * 260 = 1.5 Trillion Edges 6 Billion edges ranked Each node is a feature vector representing a User Each edge is a feature vector representing a Relationship
  • 38. Feature Generation • Mixed Features • A series of map-reduce jobs • Output on HDFS in flat files; Input to subsequent jobs • Orders = Event  Attendees • MAP: eid: uid • REDUCE: eid:[uid] • Attendees  Social Graph • Input: eid:[uid] • MAP: uidi:[uid] • REDUCE: uid:[neighbors] • Interest based features, user specific, graph mining etc • Upload feature values to HBase
  • 39. HBase • Why Hbase? • To process 6B edges lookup features for each node and each edge • 6B/1000 /86400 = 70 days!! • 1M/sec = 1.5 hrs • Processing 1.3 TB of data with mapreduce • Collect data from multiple Map Reduce jobs • Stores entire social graph • Features for each node and edge
  • 40. Data Model Rowkey U UU uid1 f1 f2 f3 uid2:f4 uid2:f5 uid3:f4 rowid neighbors events featureX 2718282 101 3 0.3678795 rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx 2718282 31 1 0.3183 83 2 0.618
  • 41. U1 U2 U3
  • 42. HBase
  • 43. Hadoop Tips & Tricks • Joins • Distributed cache • Hive map side joins • Hive • Nice set of statistical functions • Lots of hive queries • Hbase • Lots of memory • WAL • LZO • Proper configs • Avoid hot regioservers
  • 44. Hadoop tips & tricks • Combiners did not work • Shuffle and Merge
  • 45. More Innovation • Rethink everything • Add social to search • Add time series features • Real time updates using firehose and storm • Various sorts of data
  • 46. Developers! Developers! Developers! • Interested in scaling, messaging, data, machine learning, mobile, services • We will continue to push the boundaries of hard problems • jobs@eventbrite.com • vipul@eventbrite.com
  • 47. Storm at Eventbrite Tuesday August 21, 2012 at Eventbrite HQ How we are using Storm for real time processing of our data http://www.eventbrite.com/event/4010290888 Andrew Whangwhang@eventbrite.co m