Extracting Insights from Data at Twitter

Extracting Insights from Data at Twitter
Prasad Wagle
Technical Lead, Core Data and Metrics, Data Platform
twitter.com/prasadwagle
Jan 26, 2016

● What are the properties of Big Data at Twitter?
● Where do we store it and how do we process it?
● What do we learn from the data?
Overview of the talk

● Velocity: Rate at which data is created
○ 313 million monthly active users. (June 2016)
○ Hundreds of millions of Tweets are sent per day. TPS record:
one-second peak of 143,199 Tweets per second
○ 100 Billion interaction events per day
● Volume: 100s of petabytes of data
● Variety: Tweets, Users, Client events and many more
○ Client events logs have a unified Thrift format for wide variety of
application events
3Vs of Big Data @Twitter

Data Processing Big Picture
Production
systems
Batch
Scalding
Spark
Real-time
Heron
Lambda (Batch + Real-time)
Summingbird
TSAR
Interactive
Presto
Vertica
R
Custom
Dashboards
Tableau
Apache Zeppelin
Command line
tools
Batch
Hadoop
(HDFS
MapReduce)
Analytics Tools
Analytics Front-ends
Real-time
Eventbus,
Kafka
Streams
Data Abstraction Layer (DAL), Pipeline Orchestration

● Batch Processing Engine - Hadoop
● Real-time Processing Engine - Heron
● Core Data Libraries - Scalding, Summingbird, Tsar, Parquet
● Data Pipeline - Data Access Layer (DAL), Orchestration
● Interactive SQL - Presto, Vertica
● Data Visualization - Tableau, Apache Zeppelin
● Core Data and Metrics
Data Platform Projects

● Largest Hadoop clusters in the world, some > 10K nodes
● Store 100s of petabytes of data
● More than 100K daily jobs
● Improvements to open source hadoop software
● hRaven - tool that collects run time data of hadoop jobs and lets users
visualize job metrics
○ YARN Timelineserver is next-gen hRaven
● Log pipeline software (scribe -> HDFS)
○ Scribe is being replace by Flume
Hadoop

● Heron - a real-time, distributed, fault tolerant stream processing engine
● Successor of Storm, API compatible with Storm
● Analyze data as it is being produced
● > 400 real-time jobs, 500 B events / day processed, 25 - 200 ms latency
● Use cases
○ Real-time impression and engagement counts
○ Real-time trends, recommendations, spam detection
Real-time Processing

● Tools that make it easy to create MapReduce and Heron jobs
● Scalding
○ Scala DSL on top of Cascading
● Summingbird
○ Lambda architecture: real-time and batch
● Tsar: TimeSeries AggregatoR
○ DSL implemented on top of Summingbird
Core Data Libraries

● DAL is a service that simplifies the discovery, usage, and maintainability
of data
● Users work with logical datasets
● Physical dataset describes the serialization of a logical dataset to a
specific location (hadoop, vertica) and format
● Logical dataset can simultaneously exist in multiple places
● Users can use logical dataset name to consume data with different
tools like Scalding, Presto
Data Access Layer (DAL)

● Eagleeye web application is front-end for end users
● Users discover datasets with Eagleeye
● Eagleeye displays metadata like owners and schema
● Applications access to datasets is recorded
● Enables Eagleye to show dependency graphs for a dataset - jobs that
produce a dataset and jobs that consume it
Data Access Layer (DAL)

● Statebird service
○ Tracks state of batch jobs
○ Used to manage dependencies
Pipeline Orchestration

● Interactive means that results of a query are available in the range of
seconds to a few minutes
● SQL is still the lingua franca for ad hoc data analysis
● Vertica
○ Columnar architecture, high performance analytics queries
● Presto
○ Data in HDFS in Parquet format
Interactive SQL

● Custom Dashboards
● Apache Zeppelin Strengths
○ Notebook metaphor - notebook is a collection of notes, each note
is a collection of paragraphs (queries)
○ Web based report authoring, collaborative like Google docs
○ Very easy to create a note and then share it
○ > 2K notes, 18K queries
○ Supports JDBC (Presto, Vertica, MySQL)
○ Open source, Easy to add new interpreters like Scalding
Data Visualization

● Tableau Strengths
○ Easy to create reports, does not require SQL expertise
○ Built in analytics functions e.g. Rank, Percentile
○ Polished visualizations
○ Row level security
Data Visualization

● Big part of data analysis is data cleansing
● Makes sense to do this once
● Core Data
○ Create pipelines to create “verified” datasets like Users, Tweets,
Interactions
○ Reliable and easy to use
● Core Metrics
○ Create pipelines to compute Twitter’s important metrics
○ DAU, MAU, Tweet Impressions
Core Data and Metrics

● Analytics - Basic Counting
● A/B Testing
● Data Science - Custom analysis
● Data Science - Machine Learning
Data Processing

● Daily/Monthly Active Users
● Number of Tweets, Retweets, Likes
● Tweet Impressions
● Logic is relatively simple
● Challenges: scale and timeliness
○ Results for previous day should be available by 10 am
○ Some metrics are real-time
Basic Counting

● Goal: find the number of impressions and engagements for a tweet
● Real-time
● Used in analytics.twitter.com
Example - Counting Tweet Impressions

aggregate {
onKeys(
(TweetId)
) produce (
Count
) sinkTo (Manhattan)
} fromProducer {
ClientEventSource(“client_events”)
.filter { event => isImpressionEvent(event) }
.map { event =>
(event.timestamp, ImpressionAttributes(event.tweetId))
}
}
TSAR job
Dimension
Metric
Data Sink
Data Source

● TSAR job is converted to a Summingbird job
● Summingbird job creates
○ Real-time pipeline with Heron
○ Batch pipeline with Scalding
● Users access results using TSAR query service
● Write once, run batch and real-time
Example - Counting Tweet Impressions

● Experimentation is at the heart of Twitter’s product development cycle
● Expertise needed in Statistics and Technology
A/B Testing Framework

● Goal: informative experiment,
● Minimize false positive and false negative errors
● How many users do we need to sample?
● How long should we run the experiment?
A/B Testing Statistics

● Process 100 B events daily, compute intensive.
● Metrics computed using Scalding pipeline that combines client event
logs, internal user models, and other datasets.
● Lightweight statistics are computed in a streaming job using TSAR
running on Heron.
A/B Testing Technology

● Cause of spikes and dips in key metrics
● Growth Trends
○ By country, client
● Analysis to understand user behavior
○ Creators vs Consumers
○ Distribution of followers
○ User clusters
● Analysis to inform product feature decisions
Data Science - Custom Analysis

● Recommendations
○ Users: WTF - who to follow
○ Tweets: Algorithmic timeline
● Cortex, Deep learning based on Torch framework
○ Identify NSFW images
○ Recognize what is happening in live feeds
Data Science - Machine Learning

● Product Safety
○ Detect fake accounts
○ Detect tweet spam and abuse
● Ad Targeting
○ Promoted Trends, Accounts and Tweets
○ Show only if it is likely to be interesting and relevant to that user
○ Predict click probability using signals including what a user
chooses to follow, how they interact with a Tweet and what they
retweet
Machine Learning

● Systems (Hadoop, Vertica)
○ Necessary because higher level abstraction are leaky
● Programming (Scala, Scalding, SQL)
● Math (Statistics, Linear Algebra)
Ideal Talent Stack
Systems Programming Statistics
Data Engineers Data Scientists

Data Platform and Data Science
work hand-in-hand
to extract insights from Big Data at Twitter
Summary

● TSAR https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
● DAL https://blog.twitter.com/2016/discovery-and-consumption-of-analytics-data-at-twitter
● Heron https://blog.twitter.com/2015/flying-faster-with-twitter-heron
● Heron http://www.slideshare.net/KarthikRamasamy3
● A/B testing https://blog.twitter.com/2015/twitter-experimentation-technical-overview
● A/B testing https://blog.twitter.com/2016/power-minimal-detectable-effect-and-bucket-size-estimation-in-ab-tests
● Algorithmic timeline: https://support.twitter.com/articles/164083
● Cortex https://www.technologyreview.com/s/601284/twitters-artificial-intelligence-knows-whats-happening-in-live-video-clips/
● Cortex https://www.wired.com/2015/07/twitters-new-ai-recognizes-porn-dont/
References

Extracting Insights from Data at Twitter

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Extracting Insights from Data at Twitter

Semelhante a Extracting Insights from Data at Twitter (20)

Último

Último (20)

Extracting Insights from Data at Twitter