Eventbrite Data Platform Talk foir SFDM

Data Platform
Vipul Sharma – vipul@eventbrite.com

A social event ticketing and discovery platform

$1B total sales
68M tickets sold
1.4M events hosted
.5M organizers served
23M attendees served
12 countries

Post Event Conception

Organization Event Lifecycle Creation

Sale Discovery

• Search
Discovery • Recommendation
• Social

• Data warehouse and Metrics
Analytics • Internal and External reporting
• Real Time and Batch Analytics

Abuse • Spam
• Fraud
Prevention • TOS

Analytics

• Add–Hoc queries by Analysts

Hadoop Cluster

• 30 persistent EC2 High-Memory Instances
• 30TB disk with replication factor of 2, ext3 formatted
• CDH3
• Fair Scheduler
• HBase

Infrastructure

• Search
• Solr
• Incremental updates towards event driven
• Recommendation/Graph
• Hadoop
• Native Java MapReduce
• Bash for workflow
• Social
• Cassandra
• Denormalizedvview
• Persistence
• MySql
• HDFS
• HBase
• MongoDB (Moving to Cassandra)

Infrastructure

• Stream
• RabbitMQ
• Internal Fire hose
• Storm
• Offline
• MapRedude
• Streaming
• Hive
• Hue

Discovery
Social, Interest, Local

Attendees

Events

Organizers

Categorization - Prism

Tech
Conference
Music

Sports

Prism - Features

• Supervised Learning
• Logistic Regression using MLE
• Pair wise classification into 20 categories
• High precision lower recall
• Use mapreduce for feature extraction
• Use for clustering as well

Prism – Training Data

• Binary classification for each category
• Training data needed for positive and negative
• Conference and not Conference
• Sports and not Sports
• Samasource and Crowdflower
• Stem words to create initial set
• Positive, negative, negative with stem words

Prism - Features

• Convert Event and Organizer data in feature vector
• Event details, Organizer details, Ticket details
• Boolean representation of predefined attributes
• Words – tf-idf, dictonaries
• Phrases
• Domains
• Rules – regular expression
• Functions – business logic e.g. ticket price between $10-$20
• Compounds – boolean combination of features & and || rules
– <COMPOUND1>:techcrunch& disrupt &techcrunch.com
– <COMPOUND2>:COMPOUND2 && after && party

Prism - Features

• Each feature is represented in various context
• Event Title, Event Description, Organizer Title, Organizer
Description
• Each feature has meta info – Termclass
• <LANG_EN>, <CONF_LANG_EN>,<ADULT_LANG_EN>
• <SPORTS_LANG_EN>:<EVENT_TITLE>ball
• Feature vector is represented as sparse vector

+1 391158:1 401814:1 410526:1 411489:1 411606:2 413910:1
427659:1 438369:1 449735:1 449736:2 455478:1 456741:1
463188:1
693|||||warrior spirit's 3rd annual fundraising
auction|||||1:<DESC>again,1:<NAME>annual,1:<DESC>annu
al,2:<DESC>approaching,2:<NAME>auction,4:<DESC>auctio
n,2:<DESC>auctions,2:<DESC>bring

Prism - Training

• Binary classifier
• Multiclass less accurate
• Each event get classified into 20 category
• MapReduce for creating sparse matrix
• MapReduce for batch classification
• Distributed cache for feature set and models
• We can use same sparse matrix for clustering

Attendee

• What your interests are? - Prism
• Who your friends are? – Explicit and Implicit
• What are the interests of your friends? - Prism
• Which of your friend have your interests? – IBG
• Location of users and events
• Purchase events location
• Facebook location
• Our database
• Other signals – ip, mobile app etc

You will like to attend this event

Recommendation Engines

Interest Graph
Based
Social Graph
Based (Your (Your friends who
friends like Lady like rock music
Collaborative Gaga so you will like you are
Filtering – Item- like Lady Gaga, attending Eric
Item similarity PYMK – Facebook, Clapton Event–
Linkedin) Eventbrite)
Collaborative (You like
Filtering – User- Godfather so you
User Similarity will like Scarface -
Netflix)
(People who
Item bought camera
Hierarchy also bought
batteries -
(You bought Amazon)
camera so you
need batteries
- Amazon)

Why Interest?

Events are Social Events are Interest

Dense Graph is Irrelevant
Interest are Changing

How do we know your Interest?

• We ask you
• Based on your activity
• Events Attended
• Events Browsed (In Future)
• Facebook Interests
• User Interest has to match Event category
• Static
• Prism

Model Based vs Clustering

Item-Item vs User-User

Building Social Graph is Clustering Step

Social Graph Recommendation is a Ranking Problem

Implicit Social Graph

U1

E1 E4

U2 U3

E2 E3

U4 U5

Mixed Social Graph

U1

E1

U2 U3

E2 E3
FB
U4 U5
LI

23M * 260 * 260 = 1.5 Trillion Edges
6 Billion edges ranked
Each node is a feature vector representing a User

Each edge is a feature vector representing a Relationship

Feature Generation

• Mixed Features
• A series of map-reduce jobs
• Output on HDFS in flat files; Input to subsequent jobs
• Orders = Event  Attendees
• MAP: eid: uid
• REDUCE: eid:[uid]
• Attendees  Social Graph
• Input: eid:[uid]
• MAP: uidi:[uid]
• REDUCE: uid:[neighbors]
• Interest based features, user specific, graph mining etc
• Upload feature values to HBase

HBase

• Why Hbase?
• To process 6B edges lookup features for each node and each
edge
• 6B/1000 /86400 = 70 days!!
• 1M/sec = 1.5 hrs
• Processing 1.3 TB of data with mapreduce
• Collect data from multiple Map Reduce jobs
• Stores entire social graph
• Features for each node and edge

Data Model

Rowkey U UU

uid1 f1 f2 f3 uid2:f4 uid2:f5 uid3:f4

rowid neighbors events featureX
2718282 101 3 0.3678795

rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx
2718282 31 1 0.3183 83 2 0.618

Hadoop Tips & Tricks

• Joins
• Distributed cache
• Hive map side joins
• Hive
• Nice set of statistical functions
• Lots of hive queries
• Hbase
• Lots of memory
• WAL
• LZO
• Proper configs
• Avoid hot regioservers

Hadoop tips & tricks

• Combiners did not work
• Shuffle and Merge

More Innovation

• Rethink everything
• Add social to search
• Add time series features
• Real time updates using firehose and storm
• Various sorts of data

Developers! Developers! Developers!

• Interested in scaling, messaging, data, machine learning,
mobile, services

• We will continue to push the boundaries of hard
problems

• jobs@eventbrite.com
• vipul@eventbrite.com

Storm at Eventbrite

Tuesday August 21, 2012 at Eventbrite HQ

How we are using Storm for real time processing of our data

http://www.eventbrite.com/event/4010290888

Andrew
Whangwhang@eventbrite.co
m

Eventbrite Data Platform Talk foir SFDM

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (7)

Semelhante a Eventbrite Data Platform Talk foir SFDM

Semelhante a Eventbrite Data Platform Talk foir SFDM (20)

Último

Último (20)

Eventbrite Data Platform Talk foir SFDM