SlideShare a Scribd company logo
1 of 60
Rainbird:
Real-time Analytics @Twitter
Kevin Weil -- @kevinweil
Product Lead for Revenue, Twitter




                                    TM
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
My Background
‣   Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh
    routing algorithms, GBs of data
‣   Cooliris (web media): Hadoop and Pig for
    analytics, TBs of data
‣   Twitter: Hadoop, Pig, HBase, Cassandra, data
    viz, social graph analysis, soon to be PBs of data
My Background
‣   Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh
    routing algorithms, GBs of data
‣   Cooliris (web media): Hadoop and Pig for
    analytics, TBs of data
‣   Twitter: Hadoop, Pig, HBase, Cassandra, data
    viz, social graph analysis, soon to be PBs of data
    Now revenue products!
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Why Real-time Analytics
‣   Twitter is real-time
Why Real-time Analytics
‣   Twitter is real-time
‣   ... even in space
And My Personal Favorite
And My Personal Favorite
Real-time Reporting
‣   Discussion around ad-based revenue model
‣   Help shape the conversation in real-time with
    Promoted Tweets
Real-time Reporting
‣   Discussion around ad-based revenue model
‣   Help shape the conversation in real-time with
    Promoted Tweets
‣   Realtime reporting
    ties it all together
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS

‣   Horizontally scalable (reads, storage, etc)
‣      Needs to scale to 100+ TB
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS

‣   Horizontally scalable (reads, storage, etc)
‣      Needs to scale to 100+ TB

‣   Low latency
‣      Most reads <100 ms (esp. recent data)
Cassandra
‣   Pro: In-house expertise
‣   Pro: Open source Apache project
‣   Pro: Writes are extremely fast
‣   Pro: Horizontally scalable, low latency
‣   Pro: Other startup adoption (Digg, SimpleGeo)
Cassandra
‣   Pro: In-house expertise
‣   Pro: Open source Apache project
‣   Pro: Writes are extremely fast
‣   Pro: Horizontally scalable, low latency
‣   Pro: Other startup adoption (Digg, SimpleGeo)




‣   Con: It was really young (0.3a)
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
‣   A dude from
    Sweden began helping: @skr
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
‣   A dude from
    Sweden began helping: @skr


‣   Now all at Twitter :)
Rainbird
‣   It counts things. Really quickly.
‣   Layers on top of the distributed
    counters patch, CASSANDRA-1072
Rainbird
‣   It counts things. Really quickly.
‣   Layers on top of the distributed
    counters patch, CASSANDRA-1072


‣   Relies on Zookeeper, Cassandra, Scribe, Thrift
‣   Written in Scala
Rainbird Design
‣   Aggregators
    buffer for 1m
‣   Intelligent
    flush to
    Cassandra
‣   Query
    servers read
    once written
‣   1m is
    configurable
Rainbird Data Structures
struct Event
{
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Unix timestamp of event
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Stat category name
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Stat keys (hierarchical)
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Actual count (diff)
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               More later
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Hierarchical Aggregation
‣   Say we’re counting Promoted Tweet impressions
‣   category = pti
‣   keys = [advertiser_id, campaign_id, tweet_id]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [advertiser_id, campaign_id, tweet_id]
‣      [advertiser_id, campaign_id]
‣      [advertiser_id]
‣   Means fast queries over each level of hierarchy
‣   Configurable in rainbird.conf, or dynamically via ZK
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]
‣      [com, amazon]
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 full URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any music.amazon.com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any amazon.com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any .com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Temporal Aggregation
‣   Rainbird also does (configurable) temporal
    aggregation
‣   Each count is kept minutely, but also
    denormalized hourly, daily, and all time
‣   Gives us quick counts at varying granularities
    with no large scans at read time
‣      Trading storage for latency
Multiple Formulas
‣   So far we have talked about sums
‣   Could also store counts (1 for each event)
‣   ... which gives us a mean
‣   And sums of squares (count * count for each event)
‣   ... which gives us a standard deviation
‣   And min/max as well


‣   Configure this per-category in rainbird.conf
Rainbird
‣   Write 100,000s of events per second, each with
    hierarchical structure
‣   Query with minutely granularity over any level of
    the hierarchy, get back a time series
‣   Or query all time values
‣   Or query all time means, standard deviations
‣   Latency < 100ms
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Production Uses
‣   It turns out we need to count things all the time
‣   As soon as we had this service, we started
    finding all sorts of use cases for it
‣      Promoted Products
‣      Tweeted URLs, by domain/subdomain
‣      Per-user Tweet interactions (fav, RT, follow)
‣      Arbitrary terms in Tweets
‣      Clicks on t.co URLs
Use Cases
‣   Promoted Tweet Analytics
Each different metric is part
Production Uses                of the key hierarchy

‣   Promoted Tweet Analytics
Uses the temporal
                               aggregation to quickly show
Production Uses                different levels of granularity

‣   Promoted Tweet Analytics
Data can be historical, or
Production Uses                from 60 seconds ago

‣   Promoted Tweet Analytics
Production Uses
‣   Internal Monitoring and Alerting




‣   We require operational reporting on all internal services
‣   Needs to be real-time, but also want longer-term
    aggregates
‣   Hierarchical, too: [stat,   datacenter, service, machine]
Production Uses
‣   Tweet Button Counts




‣   Tweet Button counts are requested many many
    times each day from across the web
‣   Uses the all time field
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Open Source?
‣   Yes!
Open Source?
‣   Yes!   ... but not yet
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
‣   It will happen
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
‣   It will happen
‣   See http://github.com/twitter for proof of how much
    Twitter    open source
Team
‣   John Corwin (@johnxorz)
‣   Adam Samet (@damnitsamet)
‣   Johan Oskarsson (@skr)
‣   Kelvin Kakugawa (@kelvin)
‣   Chris Goffinet (@lenn0x)
‣   Steve Jiang (@sjiang)
‣   Kevin Weil (@kevinweil)
If You Only Remember One Slide...
‣   Rainbird is a distributed, high-volume counting
    service built on top of Cassandra
‣   Write 100,000s events per second, query it with
    hierarchy and multiple time granularities, returns
    results in <100 ms
‣   Used by Twitter for multiple products internally,
    including our Promoted Products, operational
    monitoring and Tweet Button
‣   Will be open sourced so the community can use and
    improve it!
Questions?
        Follow me: @kevinweil




                       TM

More Related Content

What's hot

Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBCody Ray
 
Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015IMC Institute
 
Scaling to 1,000,000 concurrent users on the JVM
Scaling to 1,000,000 concurrent users on the JVMScaling to 1,000,000 concurrent users on the JVM
Scaling to 1,000,000 concurrent users on the JVMPursuit Consulting
 
Big data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartBig data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartIMC Institute
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Samir Bessalah
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerIMC Institute
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAmazon Web Services
 
Scaling Redis Workloads with Amazon ElastiCache - AWS Online Tech Talks
Scaling Redis Workloads with Amazon ElastiCache - AWS Online Tech TalksScaling Redis Workloads with Amazon ElastiCache - AWS Online Tech Talks
Scaling Redis Workloads with Amazon ElastiCache - AWS Online Tech TalksAmazon Web Services
 
Install Apache Hadoop for Development/Production
Install Apache Hadoop for  Development/ProductionInstall Apache Hadoop for  Development/Production
Install Apache Hadoop for Development/ProductionIMC Institute
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartIMC Institute
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
 
개발자가 알아두면 좋을 5가지 AWS 인공 지능 깨알 지식 - 윤석찬 (AWS 테크 에반젤리스트)
개발자가 알아두면 좋을 5가지 AWS 인공 지능 깨알 지식 - 윤석찬 (AWS 테크 에반젤리스트)개발자가 알아두면 좋을 5가지 AWS 인공 지능 깨알 지식 - 윤석찬 (AWS 테크 에반젤리스트)
개발자가 알아두면 좋을 5가지 AWS 인공 지능 깨알 지식 - 윤석찬 (AWS 테크 에반젤리스트)Amazon Web Services Korea
 
Advanced Container Management and Scheduling - DevDay Los Angeles 2017
Advanced Container Management and Scheduling - DevDay Los Angeles 2017Advanced Container Management and Scheduling - DevDay Los Angeles 2017
Advanced Container Management and Scheduling - DevDay Los Angeles 2017Amazon Web Services
 
Big Data Ecosystem - 1000 Simulated Drones
Big Data Ecosystem - 1000 Simulated DronesBig Data Ecosystem - 1000 Simulated Drones
Big Data Ecosystem - 1000 Simulated DronesEspeo Software
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystemSlideCentral
 
RedisConf18 - Introducing RediSearch Aggregations
RedisConf18 - Introducing RediSearch AggregationsRedisConf18 - Introducing RediSearch Aggregations
RedisConf18 - Introducing RediSearch AggregationsRedis Labs
 

What's hot (20)

Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015
 
Scaling to 1,000,000 concurrent users on the JVM
Scaling to 1,000,000 concurrent users on the JVMScaling to 1,000,000 concurrent users on the JVM
Scaling to 1,000,000 concurrent users on the JVM
 
Linux intermediate level
Linux intermediate levelLinux intermediate level
Linux intermediate level
 
Big data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartBig data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera Quickstart
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainer
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
 
Scaling Redis Workloads with Amazon ElastiCache - AWS Online Tech Talks
Scaling Redis Workloads with Amazon ElastiCache - AWS Online Tech TalksScaling Redis Workloads with Amazon ElastiCache - AWS Online Tech Talks
Scaling Redis Workloads with Amazon ElastiCache - AWS Online Tech Talks
 
Install Apache Hadoop for Development/Production
Install Apache Hadoop for  Development/ProductionInstall Apache Hadoop for  Development/Production
Install Apache Hadoop for Development/Production
 
Amazon ElastiCache and Redis
Amazon ElastiCache and RedisAmazon ElastiCache and Redis
Amazon ElastiCache and Redis
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera Quickstart
 
Aws r
Aws rAws r
Aws r
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
개발자가 알아두면 좋을 5가지 AWS 인공 지능 깨알 지식 - 윤석찬 (AWS 테크 에반젤리스트)
개발자가 알아두면 좋을 5가지 AWS 인공 지능 깨알 지식 - 윤석찬 (AWS 테크 에반젤리스트)개발자가 알아두면 좋을 5가지 AWS 인공 지능 깨알 지식 - 윤석찬 (AWS 테크 에반젤리스트)
개발자가 알아두면 좋을 5가지 AWS 인공 지능 깨알 지식 - 윤석찬 (AWS 테크 에반젤리스트)
 
Advanced Container Management and Scheduling - DevDay Los Angeles 2017
Advanced Container Management and Scheduling - DevDay Los Angeles 2017Advanced Container Management and Scheduling - DevDay Los Angeles 2017
Advanced Container Management and Scheduling - DevDay Los Angeles 2017
 
Big Data Ecosystem - 1000 Simulated Drones
Big Data Ecosystem - 1000 Simulated DronesBig Data Ecosystem - 1000 Simulated Drones
Big Data Ecosystem - 1000 Simulated Drones
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
 
RedisConf18 - Introducing RediSearch Aggregations
RedisConf18 - Introducing RediSearch AggregationsRedisConf18 - Introducing RediSearch Aggregations
RedisConf18 - Introducing RediSearch Aggregations
 

Viewers also liked

Previews Presentation 2010
Previews Presentation 2010Previews Presentation 2010
Previews Presentation 2010Alex Caraco
 
Prestige Sales Showcase Final 2008
Prestige Sales Showcase Final 2008Prestige Sales Showcase Final 2008
Prestige Sales Showcase Final 2008Alex Caraco
 
Coldwell Banker Sharks News Articles Print &amp; Web
Coldwell Banker  Sharks News Articles  Print &amp; WebColdwell Banker  Sharks News Articles  Print &amp; Web
Coldwell Banker Sharks News Articles Print &amp; WebAlex Caraco
 
Franchise Times 2009 Top 200
Franchise Times 2009 Top 200Franchise Times 2009 Top 200
Franchise Times 2009 Top 200Alex Caraco
 
Cb Intro Presentation Final April 2010
Cb Intro Presentation Final    April 2010Cb Intro Presentation Final    April 2010
Cb Intro Presentation Final April 2010Alex Caraco
 
Global Ad Program0209 Rates And Info
Global Ad Program0209 Rates And InfoGlobal Ad Program0209 Rates And Info
Global Ad Program0209 Rates And InfoAlex Caraco
 

Viewers also liked (7)

Previews Presentation 2010
Previews Presentation 2010Previews Presentation 2010
Previews Presentation 2010
 
Prestige Sales Showcase Final 2008
Prestige Sales Showcase Final 2008Prestige Sales Showcase Final 2008
Prestige Sales Showcase Final 2008
 
BedrijvenAPK
BedrijvenAPKBedrijvenAPK
BedrijvenAPK
 
Coldwell Banker Sharks News Articles Print &amp; Web
Coldwell Banker  Sharks News Articles  Print &amp; WebColdwell Banker  Sharks News Articles  Print &amp; Web
Coldwell Banker Sharks News Articles Print &amp; Web
 
Franchise Times 2009 Top 200
Franchise Times 2009 Top 200Franchise Times 2009 Top 200
Franchise Times 2009 Top 200
 
Cb Intro Presentation Final April 2010
Cb Intro Presentation Final    April 2010Cb Intro Presentation Final    April 2010
Cb Intro Presentation Final April 2010
 
Global Ad Program0209 Rates And Info
Global Ad Program0209 Rates And InfoGlobal Ad Program0209 Rates And Info
Global Ad Program0209 Rates And Info
 

Similar to Realtimeanalyticsattwitter strata2011-110204123031-phpapp02

Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Kevin Weil
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac DawsonCODE BLUE
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWSSungmin Kim
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB
 
Using Event Streams in Serverless Applications
Using Event Streams in Serverless ApplicationsUsing Event Streams in Serverless Applications
Using Event Streams in Serverless ApplicationsJonathan Dee
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
 
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started DemoIan Massingham
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...Amazon Web Services
 
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
AWS Cloud Kata 2014 | Jakarta - 2-3 Big DataAmazon Web Services
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkAnirudh Todi
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patternsLars Albertsson
 
Introduction to Artificial Intelligence and Machine Learning services at AWS ...
Introduction to Artificial Intelligence and Machine Learning services at AWS ...Introduction to Artificial Intelligence and Machine Learning services at AWS ...
Introduction to Artificial Intelligence and Machine Learning services at AWS ...Amazon Web Services
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic
 

Similar to Realtimeanalyticsattwitter strata2011-110204123031-phpapp02 (20)

Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
 
Streams
StreamsStreams
Streams
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
 
Using Event Streams in Serverless Applications
Using Event Streams in Serverless ApplicationsUsing Event Streams in Serverless Applications
Using Event Streams in Serverless Applications
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
 
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
 
Tsar tech talk
Tsar tech talkTsar tech talk
Tsar tech talk
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patterns
 
Introduction to Artificial Intelligence and Machine Learning services at AWS ...
Introduction to Artificial Intelligence and Machine Learning services at AWS ...Introduction to Artificial Intelligence and Machine Learning services at AWS ...
Introduction to Artificial Intelligence and Machine Learning services at AWS ...
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 

Recently uploaded

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Realtimeanalyticsattwitter strata2011-110204123031-phpapp02

  • 1. Rainbird: Real-time Analytics @Twitter Kevin Weil -- @kevinweil Product Lead for Revenue, Twitter TM
  • 2. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 3. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data
  • 4. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data Now revenue products!
  • 5. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 6. Why Real-time Analytics ‣ Twitter is real-time
  • 7. Why Real-time Analytics ‣ Twitter is real-time ‣ ... even in space
  • 8. And My Personal Favorite
  • 9. And My Personal Favorite
  • 10. Real-time Reporting ‣ Discussion around ad-based revenue model ‣ Help shape the conversation in real-time with Promoted Tweets
  • 11. Real-time Reporting ‣ Discussion around ad-based revenue model ‣ Help shape the conversation in real-time with Promoted Tweets ‣ Realtime reporting ties it all together
  • 12. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 13. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS
  • 14. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS
  • 15. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS ‣ Horizontally scalable (reads, storage, etc) ‣ Needs to scale to 100+ TB
  • 16. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS ‣ Horizontally scalable (reads, storage, etc) ‣ Needs to scale to 100+ TB ‣ Low latency ‣ Most reads <100 ms (esp. recent data)
  • 17. Cassandra ‣ Pro: In-house expertise ‣ Pro: Open source Apache project ‣ Pro: Writes are extremely fast ‣ Pro: Horizontally scalable, low latency ‣ Pro: Other startup adoption (Digg, SimpleGeo)
  • 18. Cassandra ‣ Pro: In-house expertise ‣ Pro: Open source Apache project ‣ Pro: Writes are extremely fast ‣ Pro: Horizontally scalable, low latency ‣ Pro: Other startup adoption (Digg, SimpleGeo) ‣ Con: It was really young (0.3a)
  • 19. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra
  • 20. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin
  • 21. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x
  • 22. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x ‣ A dude from Sweden began helping: @skr
  • 23. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x ‣ A dude from Sweden began helping: @skr ‣ Now all at Twitter :)
  • 24. Rainbird ‣ It counts things. Really quickly. ‣ Layers on top of the distributed counters patch, CASSANDRA-1072
  • 25. Rainbird ‣ It counts things. Really quickly. ‣ Layers on top of the distributed counters patch, CASSANDRA-1072 ‣ Relies on Zookeeper, Cassandra, Scribe, Thrift ‣ Written in Scala
  • 26. Rainbird Design ‣ Aggregators buffer for 1m ‣ Intelligent flush to Cassandra ‣ Query servers read once written ‣ 1m is configurable
  • 27. Rainbird Data Structures struct Event { 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 28. Rainbird Data Structures struct Event { Unix timestamp of event 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 29. Rainbird Data Structures struct Event { Stat category name 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 30. Rainbird Data Structures struct Event { Stat keys (hierarchical) 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 31. Rainbird Data Structures struct Event { Actual count (diff) 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 32. Rainbird Data Structures struct Event { More later 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 33. Hierarchical Aggregation ‣ Say we’re counting Promoted Tweet impressions ‣ category = pti ‣ keys = [advertiser_id, campaign_id, tweet_id] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [advertiser_id, campaign_id, tweet_id] ‣ [advertiser_id, campaign_id] ‣ [advertiser_id] ‣ Means fast queries over each level of hierarchy ‣ Configurable in rainbird.conf, or dynamically via ZK
  • 34. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] ‣ [com, amazon] ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 35. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] full URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 36. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any music.amazon.com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 37. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any amazon.com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 38. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any .com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 39. Temporal Aggregation ‣ Rainbird also does (configurable) temporal aggregation ‣ Each count is kept minutely, but also denormalized hourly, daily, and all time ‣ Gives us quick counts at varying granularities with no large scans at read time ‣ Trading storage for latency
  • 40. Multiple Formulas ‣ So far we have talked about sums ‣ Could also store counts (1 for each event) ‣ ... which gives us a mean ‣ And sums of squares (count * count for each event) ‣ ... which gives us a standard deviation ‣ And min/max as well ‣ Configure this per-category in rainbird.conf
  • 41. Rainbird ‣ Write 100,000s of events per second, each with hierarchical structure ‣ Query with minutely granularity over any level of the hierarchy, get back a time series ‣ Or query all time values ‣ Or query all time means, standard deviations ‣ Latency < 100ms
  • 42. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 43. Production Uses ‣ It turns out we need to count things all the time ‣ As soon as we had this service, we started finding all sorts of use cases for it ‣ Promoted Products ‣ Tweeted URLs, by domain/subdomain ‣ Per-user Tweet interactions (fav, RT, follow) ‣ Arbitrary terms in Tweets ‣ Clicks on t.co URLs
  • 44. Use Cases ‣ Promoted Tweet Analytics
  • 45. Each different metric is part Production Uses of the key hierarchy ‣ Promoted Tweet Analytics
  • 46. Uses the temporal aggregation to quickly show Production Uses different levels of granularity ‣ Promoted Tweet Analytics
  • 47. Data can be historical, or Production Uses from 60 seconds ago ‣ Promoted Tweet Analytics
  • 48. Production Uses ‣ Internal Monitoring and Alerting ‣ We require operational reporting on all internal services ‣ Needs to be real-time, but also want longer-term aggregates ‣ Hierarchical, too: [stat, datacenter, service, machine]
  • 49. Production Uses ‣ Tweet Button Counts ‣ Tweet Button counts are requested many many times each day from across the web ‣ Uses the all time field
  • 50. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 52. Open Source? ‣ Yes! ... but not yet
  • 53. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra
  • 54. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8)
  • 55. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source
  • 56. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source ‣ It will happen
  • 57. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source ‣ It will happen ‣ See http://github.com/twitter for proof of how much Twitter open source
  • 58. Team ‣ John Corwin (@johnxorz) ‣ Adam Samet (@damnitsamet) ‣ Johan Oskarsson (@skr) ‣ Kelvin Kakugawa (@kelvin) ‣ Chris Goffinet (@lenn0x) ‣ Steve Jiang (@sjiang) ‣ Kevin Weil (@kevinweil)
  • 59. If You Only Remember One Slide... ‣ Rainbird is a distributed, high-volume counting service built on top of Cassandra ‣ Write 100,000s events per second, query it with hierarchy and multiple time granularities, returns results in <100 ms ‣ Used by Twitter for multiple products internally, including our Promoted Products, operational monitoring and Tweet Button ‣ Will be open sourced so the community can use and improve it!
  • 60. Questions? Follow me: @kevinweil TM

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n