SlideShare a Scribd company logo
1 of 32
The "Big Data" Ecosystem at LinkedIn
SIGMOD 2013
Roshan Sumbaly, Jay Kreps, & Sam Shah
June 2013
LinkedIn: the professional profile of record
©2012 LinkedIn Corporation. All Rights Reserved. 2
225MMembers 225M Member
Profiles
1 2
3
Applications
4
Application examples
 People You May Know (2 people)
 Year In Review Email (1 person, 1 month)
 Skills and Endorsements (2 people)
 Network Updates Digest (1 person, 3 months)
 Who‟s Viewed My Profile (2 people)
 Collaborative Filtering (1 person)
 Related Searches (1 person, 3 months)
 and more…
5
Skill sets
Rich Hadoop-based ecosystem
©2013 LinkedIn Corporation. All Rights Reserved. 6
“Last mile” problems
©2013 LinkedIn Corporation. All Rights Reserved. 7
 Ingress
– Moving data from online to offline system
 Workflow management
– Managing offline processes
 Egress
– Moving results from offline to online systems
 Key/Value
 Streams
 OLAP
8
Application examples
 People You May Know (2 people)
 Year In Review Email (1 person, 1 month)
 Skills and Endorsements (2 people)
 Network Updates Digest (1 person, 3 months)
 Who‟s Viewed My Profile (2 people)
 Collaborative Filtering (1 person)
 Related Searches (1 person, 3 months)
 and more…
9
People You May Know
10
People You May Know – Workflow
Perform triangle closing
for all members
Ethan
Jacob
William
connected connected
Triangle closing
Rank by discounting previously
shown recommendations
Push recommendations
to online service
Connection
stream
Impression
stream
“Last mile” problems
©2013 LinkedIn Corporation. All Rights Reserved. 11
 Ingress
– Moving data from online to offline system
 Workflow management
– Managing offline processes
 Egress
– Moving results from offline to online systems
 Key/Value
 Streams
 OLAP
Ingress - O(n2) data integration complexity
©2013 LinkedIn Corporation. All Rights Reserved. 12
 Point to point
 Fragile, delayed and potentially lossy
 Non-standardized
Ingress - O(n) data integration
©2013 LinkedIn Corporation. All Rights Reserved. 13
14
Ingress – Kafka
 Distributed and elastic
– Multi-broker system
 Categorized topics
– “PeopleYouMayKnowTopic”
– “ConnectionUpdateTopic”
15
Ingress
 Standardized schemas
– Avro
– Central repository
– Programmatic compatibility
 Audited
 ETL to Hadoop
People you may
know service
Kafka brokers (dev)
Kafka brokers
Hadoop
PeopleYouMayKnowTopic
“Last mile” problems
©2013 LinkedIn Corporation. All Rights Reserved. 16
 Ingress
– Moving data from online to offline system
 Workflow management
– Managing offline processes
 Egress
– Moving results form offline to online systems
 Key/Value
 Streams
 OLAP
17
People You May Know – Workflow
Perform triangle closing
for all members
Rank by discounting previously
shown recommendations
Push recommendations
to online service
Connection
stream
Impression
stream
18
People You May Know – Workflow (in
reality)
19
Workflow Management - Azkaban
 Dependency management
– Historical logs
 Diverse job types
– Pig, Hive, Java
 Scheduling
 Monitoring
 Visualization
 Configuration
 Retry/restart on failure
 Resource locking
20
People You May Know – Workflow
Perform triangle closing
for all members
Rank by discounting previously
shown recommendations
Push recommendations
to online service
Connection
stream
Impression
stream
Member Id 1213 =>
[ Recommended member id 1734,
Recommended member id 1523
…
Recommended member id 6332 ]
“Last mile” problems
©2013 LinkedIn Corporation. All Rights Reserved. 21
 Ingress
– Moving data from online to offline system
 Workflow management
– Managing offline processes
 Egress
– Moving results from offline to online systems
 Key/Value
 Streams
 OLAP
22
Egress – Key/Value
 Voldemort
– Based on Amazon‟s Dynamo
 Distributed and Elastic
 Horizontally scalable
 Bulk load pipeline from Hadoop
 Simple to use
store results into „url‟ using KeyValue(„member_id‟)
People you may
know service
Voldemort
Hadoop
Batch load
getRecommendations(member id)
23
People You May Know - Summary
People you may
know service
Kafka brokers (mirror)
Kafka brokers
Hadoop
PeopleYouMayKnowTopic
Voldemort
Front end
24
Application examples
 People You May Know (2 people)
 Year In Review Email (1 person, 1 month)
 Skills and Endorsements (2 people)
 Network Updates Digest (1 person, 3 months)
 Who‟s Viewed My Profile (2 people)
 Collaborative Filtering (1 person)
 Related Searches (1 person, 3 months)
 and more…
25
Year In Review Email
26
Year In Review Email
memberPosition = LOAD '$latest_positions' USING BinaryJSON;
memberWithPositionsChangedLastYear = FOREACH (
FILTER memberPosition BY ((start_date >= $start_date_low ) AND
(start_date <= $start_date_high))
) GENERATE member_id, start_date, end_date;
allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON;
allConnectionsWithChange_nondistinct = FOREACH (
JOIN memberWithPositionsChangedLastYear BY member_id,
allConnections BY dest
) GENERATE allConnections::source AS source,
allConnections::dest AS dest;
allConnectionsWithChange = DISTINCT
allConnectionsWithChange_nondistinct;
memberinfowpics = LOAD '$latest_memberinfowpics' USING
BinaryJSON;
pictures = FOREACH ( FILTER memberinfowpics BY
((cropped_picture_id is not null) AND
( (member_picture_privacy == 'N') OR
(member_picture_privacy == 'E')))
) GENERATE member_id, cropped_picture_id, first_name as
dest_first_name, last_name as dest_last_name;
resultPic = JOIN allConnectionsWithChange BY dest, pictures
BY member_id;
connectionsWithChangeWithPic = FOREACH resultPic GENERATE
allConnectionsWithChange::source AS source_id,
allConnectionsWithChange::dest AS member_id,
pictures::cropped_picture_id AS pic_id,
pictures::dest_first_name AS dest_first_name,
pictures::dest_last_name AS dest_last_name;
joinResult = JOIN connectionsWithChangeWithPic BY source_id,
memberinfowpics BY member_id;
withName = FOREACH joinResult GENERATE
connectionsWithChangeWithPic::source_id AS source_id,
connectionsWithChangeWithPic::member_id AS member_id,
connectionsWithChangeWithPic::dest_first_name as first_name,
connectionsWithChangeWithPic::dest_last_name as last_name,
connectionsWithChangeWithPic::pic_id AS pic_id,
memberinfowpics::first_name AS firstName,
memberinfowpics::last_name AS lastName,
memberinfowpics::gmt_offset as gmt_offset,
memberinfowpics::email_locale as email_locale,
memberinfowpics::email_address as email_address;
resultGroup = GROUP withName BY (source_id, firstName,
lastName, email_address, email_locale, gmt_offset);
-- Get the count of results per recipient
resultGroupCount = FOREACH resultGroup GENERATE group ,
withName as toomany, COUNT_STAR(withName) as num_results;
resultGroupPre = filter resultGroupCount by num_results > 2;
resultGroup = FOREACH resultGroupPre {
withName = LIMIT toomany 64;
GENERATE group, withName, num_results;
}
x_in_review_pre_out = FOREACH resultGroup GENERATE
FLATTEN(group) as (source_id, firstName, lastName,
email_address, email_locale, gmt_offset),
withName.(member_id, pic_id, first_name, last_name) as
jobChanger, '2013' as changeYear:chararray,
num_results as num_results;
x_in_review = FOREACH x_in_review_pre_out GENERATE
source_id as recipientID, gmt_offset as gmtOffset,
firstName as first_name, lastName as last_name, email_address,
email_locale,
TOTUPLE( changeYear, source_id,firstName, lastName,
num_results,jobChanger) as body;
rmf $xir;
STORE x_in_review INTO '$url' USING Kafka();
27
Year In Review Email – Workflow
Find users that have
changed jobs
Join with connections
and metadata (pictures)
Group by connections of
these users
Push content to email
service
“Last mile” problems
©2013 LinkedIn Corporation. All Rights Reserved. 28
 Ingress
– Moving data from online to offline system
 Workflow management
– Managing offline processes
 Egress
– Moving results from offline to online systems
 Key/Value
 Streams
 OLAP
29
Egress - Streams
 Service acts as consumer
 “EmailContentTopic”
store emails into „url‟ using Stream(“topic=x“)
Email service
Kafka brokers (mirror)
Kafka brokers
Hadoop
EmailSentTopic
Email service
Kafka brokers (mirror)
Kafka brokers
Hadoop
EmailContentTopic
30
Conclusion
 Hadoop: simple programmatic model, rich developer ecosystem
 Primitives for
– Ingress:
 Structured, complete data available
 Automatically handles data evolution
– Workflow management
 Run and operate production processes
– Egress
 1-line command for data for exporting data
 Horizontally scalable, little need for capacity planning
 Empowers data scientists to focus on new product ideas,
not infrastructure
Future work: models of computation
• Alternating Direction Method of Multipliers (ADMM)
• Distributed Conjugate Gradient Descent (DCGD)
• Distributed L-BFGS
• Bayesian Distributed Learning (BDL)
Graphs
Distributed learning
Near-line processing
32
data.linkedin.com

More Related Content

What's hot

Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkMongoDB
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)Jun Rao
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]Shirshanka Das
 
Family tree of data – provenance and neo4j
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4jM. David Allen
 
LinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformLinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformHien Luu
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Amy W. Tang
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with sparkMarissa Saunders
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph PresentationAmy W. Tang
 
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
 Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDBMongoDB
 
Webinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBWebinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBMongoDB
 
The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...George Anadiotis
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to GraphsNeo4j
 
Neo4J : Introduction to Graph Database
Neo4J : Introduction to Graph DatabaseNeo4J : Introduction to Graph Database
Neo4J : Introduction to Graph DatabaseMindfire Solutions
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jDebanjan Mahata
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionSteve Loughran
 

What's hot (20)

Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
 
Family tree of data – provenance and neo4j
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4j
 
LinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformLinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting Platform
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with spark
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
 Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
 
Webinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBWebinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDB
 
The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
 
Neo4J : Introduction to Graph Database
Neo4J : Introduction to Graph DatabaseNeo4J : Introduction to Graph Database
Neo4J : Introduction to Graph Database
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
 

Similar to The "Big Data" Ecosystem at LinkedIn

Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
ATAGTR2017 HikeRunner: Load Test Framework
ATAGTR2017 HikeRunner: Load Test FrameworkATAGTR2017 HikeRunner: Load Test Framework
ATAGTR2017 HikeRunner: Load Test FrameworkAgile Testing Alliance
 
From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthdaveconnors
 
How we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used ElasticsearchHow we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used ElasticsearchMichael Stockerl
 
2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...
2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...
2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...Modern Workplace Conference Paris
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Johann de Boer
 
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...Eric D. Boyd
 
Canarie Federated Non Web Signon
Canarie Federated Non Web SignonCanarie Federated Non Web Signon
Canarie Federated Non Web SignonChris Phillips
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSSKevin Crocker
 
Sentiment Analysis in Dynamics CRM using Azure Text Analytics
Sentiment Analysis in Dynamics CRM using Azure Text AnalyticsSentiment Analysis in Dynamics CRM using Azure Text Analytics
Sentiment Analysis in Dynamics CRM using Azure Text AnalyticsLucas Alexander
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revisedMongoDB
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessMongoDB
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...Connected Data World
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for GraphsJean Ihm
 
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...Amazon Web Services
 

Similar to The "Big Data" Ecosystem at LinkedIn (20)

Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
ATAGTR2017 HikeRunner: Load Test Framework
ATAGTR2017 HikeRunner: Load Test FrameworkATAGTR2017 HikeRunner: Load Test Framework
ATAGTR2017 HikeRunner: Load Test Framework
 
From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 
How we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used ElasticsearchHow we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used Elasticsearch
 
2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...
2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...
2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
 
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
 
Canarie Federated Non Web Signon
Canarie Federated Non Web SignonCanarie Federated Non Web Signon
Canarie Federated Non Web Signon
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
Sentiment Analysis in Dynamics CRM using Azure Text Analytics
Sentiment Analysis in Dynamics CRM using Azure Text AnalyticsSentiment Analysis in Dynamics CRM using Azure Text Analytics
Sentiment Analysis in Dynamics CRM using Azure Text Analytics
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revised
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational Awareness
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for Graphs
 
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

The "Big Data" Ecosystem at LinkedIn

  • 1. The "Big Data" Ecosystem at LinkedIn SIGMOD 2013 Roshan Sumbaly, Jay Kreps, & Sam Shah June 2013
  • 2. LinkedIn: the professional profile of record ©2012 LinkedIn Corporation. All Rights Reserved. 2 225MMembers 225M Member Profiles 1 2
  • 4. 4 Application examples  People You May Know (2 people)  Year In Review Email (1 person, 1 month)  Skills and Endorsements (2 people)  Network Updates Digest (1 person, 3 months)  Who‟s Viewed My Profile (2 people)  Collaborative Filtering (1 person)  Related Searches (1 person, 3 months)  and more…
  • 6. Rich Hadoop-based ecosystem ©2013 LinkedIn Corporation. All Rights Reserved. 6
  • 7. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 7  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
  • 8. 8 Application examples  People You May Know (2 people)  Year In Review Email (1 person, 1 month)  Skills and Endorsements (2 people)  Network Updates Digest (1 person, 3 months)  Who‟s Viewed My Profile (2 people)  Collaborative Filtering (1 person)  Related Searches (1 person, 3 months)  and more…
  • 10. 10 People You May Know – Workflow Perform triangle closing for all members Ethan Jacob William connected connected Triangle closing Rank by discounting previously shown recommendations Push recommendations to online service Connection stream Impression stream
  • 11. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 11  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
  • 12. Ingress - O(n2) data integration complexity ©2013 LinkedIn Corporation. All Rights Reserved. 12  Point to point  Fragile, delayed and potentially lossy  Non-standardized
  • 13. Ingress - O(n) data integration ©2013 LinkedIn Corporation. All Rights Reserved. 13
  • 14. 14 Ingress – Kafka  Distributed and elastic – Multi-broker system  Categorized topics – “PeopleYouMayKnowTopic” – “ConnectionUpdateTopic”
  • 15. 15 Ingress  Standardized schemas – Avro – Central repository – Programmatic compatibility  Audited  ETL to Hadoop People you may know service Kafka brokers (dev) Kafka brokers Hadoop PeopleYouMayKnowTopic
  • 16. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 16  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results form offline to online systems  Key/Value  Streams  OLAP
  • 17. 17 People You May Know – Workflow Perform triangle closing for all members Rank by discounting previously shown recommendations Push recommendations to online service Connection stream Impression stream
  • 18. 18 People You May Know – Workflow (in reality)
  • 19. 19 Workflow Management - Azkaban  Dependency management – Historical logs  Diverse job types – Pig, Hive, Java  Scheduling  Monitoring  Visualization  Configuration  Retry/restart on failure  Resource locking
  • 20. 20 People You May Know – Workflow Perform triangle closing for all members Rank by discounting previously shown recommendations Push recommendations to online service Connection stream Impression stream Member Id 1213 => [ Recommended member id 1734, Recommended member id 1523 … Recommended member id 6332 ]
  • 21. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 21  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
  • 22. 22 Egress – Key/Value  Voldemort – Based on Amazon‟s Dynamo  Distributed and Elastic  Horizontally scalable  Bulk load pipeline from Hadoop  Simple to use store results into „url‟ using KeyValue(„member_id‟) People you may know service Voldemort Hadoop Batch load getRecommendations(member id)
  • 23. 23 People You May Know - Summary People you may know service Kafka brokers (mirror) Kafka brokers Hadoop PeopleYouMayKnowTopic Voldemort Front end
  • 24. 24 Application examples  People You May Know (2 people)  Year In Review Email (1 person, 1 month)  Skills and Endorsements (2 people)  Network Updates Digest (1 person, 3 months)  Who‟s Viewed My Profile (2 people)  Collaborative Filtering (1 person)  Related Searches (1 person, 3 months)  and more…
  • 26. 26 Year In Review Email memberPosition = LOAD '$latest_positions' USING BinaryJSON; memberWithPositionsChangedLastYear = FOREACH ( FILTER memberPosition BY ((start_date >= $start_date_low ) AND (start_date <= $start_date_high)) ) GENERATE member_id, start_date, end_date; allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON; allConnectionsWithChange_nondistinct = FOREACH ( JOIN memberWithPositionsChangedLastYear BY member_id, allConnections BY dest ) GENERATE allConnections::source AS source, allConnections::dest AS dest; allConnectionsWithChange = DISTINCT allConnectionsWithChange_nondistinct; memberinfowpics = LOAD '$latest_memberinfowpics' USING BinaryJSON; pictures = FOREACH ( FILTER memberinfowpics BY ((cropped_picture_id is not null) AND ( (member_picture_privacy == 'N') OR (member_picture_privacy == 'E'))) ) GENERATE member_id, cropped_picture_id, first_name as dest_first_name, last_name as dest_last_name; resultPic = JOIN allConnectionsWithChange BY dest, pictures BY member_id; connectionsWithChangeWithPic = FOREACH resultPic GENERATE allConnectionsWithChange::source AS source_id, allConnectionsWithChange::dest AS member_id, pictures::cropped_picture_id AS pic_id, pictures::dest_first_name AS dest_first_name, pictures::dest_last_name AS dest_last_name; joinResult = JOIN connectionsWithChangeWithPic BY source_id, memberinfowpics BY member_id; withName = FOREACH joinResult GENERATE connectionsWithChangeWithPic::source_id AS source_id, connectionsWithChangeWithPic::member_id AS member_id, connectionsWithChangeWithPic::dest_first_name as first_name, connectionsWithChangeWithPic::dest_last_name as last_name, connectionsWithChangeWithPic::pic_id AS pic_id, memberinfowpics::first_name AS firstName, memberinfowpics::last_name AS lastName, memberinfowpics::gmt_offset as gmt_offset, memberinfowpics::email_locale as email_locale, memberinfowpics::email_address as email_address; resultGroup = GROUP withName BY (source_id, firstName, lastName, email_address, email_locale, gmt_offset); -- Get the count of results per recipient resultGroupCount = FOREACH resultGroup GENERATE group , withName as toomany, COUNT_STAR(withName) as num_results; resultGroupPre = filter resultGroupCount by num_results > 2; resultGroup = FOREACH resultGroupPre { withName = LIMIT toomany 64; GENERATE group, withName, num_results; } x_in_review_pre_out = FOREACH resultGroup GENERATE FLATTEN(group) as (source_id, firstName, lastName, email_address, email_locale, gmt_offset), withName.(member_id, pic_id, first_name, last_name) as jobChanger, '2013' as changeYear:chararray, num_results as num_results; x_in_review = FOREACH x_in_review_pre_out GENERATE source_id as recipientID, gmt_offset as gmtOffset, firstName as first_name, lastName as last_name, email_address, email_locale, TOTUPLE( changeYear, source_id,firstName, lastName, num_results,jobChanger) as body; rmf $xir; STORE x_in_review INTO '$url' USING Kafka();
  • 27. 27 Year In Review Email – Workflow Find users that have changed jobs Join with connections and metadata (pictures) Group by connections of these users Push content to email service
  • 28. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 28  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
  • 29. 29 Egress - Streams  Service acts as consumer  “EmailContentTopic” store emails into „url‟ using Stream(“topic=x“) Email service Kafka brokers (mirror) Kafka brokers Hadoop EmailSentTopic Email service Kafka brokers (mirror) Kafka brokers Hadoop EmailContentTopic
  • 30. 30 Conclusion  Hadoop: simple programmatic model, rich developer ecosystem  Primitives for – Ingress:  Structured, complete data available  Automatically handles data evolution – Workflow management  Run and operate production processes – Egress  1-line command for data for exporting data  Horizontally scalable, little need for capacity planning  Empowers data scientists to focus on new product ideas, not infrastructure
  • 31. Future work: models of computation • Alternating Direction Method of Multipliers (ADMM) • Distributed Conjugate Gradient Descent (DCGD) • Distributed L-BFGS • Bayesian Distributed Learning (BDL) Graphs Distributed learning Near-line processing