SlideShare a Scribd company logo
1 of 40
Download to read offline
Data Infrastructure at LinkedIn
Shirshanka Das
XLDB 2011




                                  1
Me

 UCLA Ph.D. 2005 (Distributed protocols in content
  delivery networks)
 PayPal (Web frameworks and Session Stores)
 Yahoo! (Serving Infrastructure, Graph Indexing, Real-time
  Bidding in Display Ad Exchanges)
 @ LinkedIn (Distributed Data Systems team): Distributed
  data transport and storage technology (Kafka, Databus,
  Espresso, ...)




                                                          2
Outline


 LinkedIn Products
 Data Ecosystem
 LinkedIn Data Infrastructure Solutions
 Next Play




                                           3
LinkedIn By The Numbers

   120,000,000+ users in August 2011
   2 new user registrations per second
   4 billion People Searches expected in 2011
   2+ million companies with LinkedIn Company Pages
   81+ million unique visitors monthly*
   150K domains feature the LinkedIn Share Button
   7.1 billion page views in Q2 2011
   1M LinkedIn Groups


* Based on comScore, Q2 2011


                                                       4
Member Profiles




                  5
Signal - faceted stream search




                                 6
People You May Know




                      7
Outline


 LinkedIn Products
 Data Ecosystem
 LinkedIn Data Infrastructure Solutions
 Next Play




                                           8
Three Paradigms : Simplifying the Data Continuum




• Member Profiles         • Signal                    • People You May Know

• Company Profiles        • Profile Standardization   • Connection Strength
• Connections             • News                      • News
• Communications          • Recommendations           • Recommendations
                          • Search                    • Next best idea
                          • Communications


 Online                   Nearline                     Offline
Activity that should     Activity that should         Activity that can be
be reflected immediately be reflected soon            reflected later


                                                                              9
Data Infrastructure Toolbox (Online)

           Capabilities                Systems   Analysis
           Key-value access
           Rich structures (e.g.
           indexes)
           Change capture
           capability
                                   {
           Search platform
           Graph engine




                                                            10
Data Infrastructure Toolbox (Nearline)

          Capabilities             Systems   Analysis

          Change capture streams

          Messaging for site
          events, monitoring
          Nearline processing




                                                        11
Data Infrastructure Toolbox (Offline)

          Capabilities         Systems   Analysis

          Machine learning,
          ranking, relevance
          Analytics on
          Social gestures




                                                    12
Laying out the tools




                       13
Outline


 LinkedIn Products
 Data Ecosystem
 LinkedIn Data Infrastructure Solutions
 Next Play




                                           14
Focus on four systems in Online and Nearline

 Data Transport
   – Kafka
   – Databus
 Online Data Stores
   – Voldemort
   – Espresso




                                               15
LinkedIn Data Infrastructure Solutions
Kafka: High-Volume Low-Latency Messaging System




                                                  16
Kafka: Architecture
                                  Broker Tier
    WebTier                                                                      Consumers

            Push          Sequential write            sendfile     Pull
                                                                                            Iterator 1




                                                                               Client Lib
            Event                                                  Events
                                     Topic 1




                                                                                Kafka
            s
            100 MB/sec                                            200 MB/sec
                                     Topic 2
                                                                                            Iterator n
                                     Topic N

                                                                                            Topic  Offset


                                             Topic, Partition                    Offset
                                             Ownership
                                                                 Zookeeper       Management


Scale                            Guarantees
    Billions of Events             At least once delivery
    TBs per day                    Very high throughput
    Inter-colo: few seconds        Low latency
    Typical retention: weeks       Durability

                                                                                                         17
LinkedIn Data Infrastructure Solutions
Databus : Timeline-Consistent Change Data Capture




                                                    18
Databus at LinkedIn
                                                                  Client
                          Relay                                    Consumer 1




                                                     Client Lib
        Capture




                                                     Databus
                                          On-line
  DB    Changes                           Changes
                       Event Win                                   Consumer n
                      On-line
                     Changes

                      Bootstrap                                   Client
                                                                   Consumer 1




                                                     Client Lib
                                                     Databus
                                      Consistent
                                     Snapshot at U                 Consumer n
                           DB

Features                             Guarantees
 Transport independent of data       Transactional semantics
  source: Oracle, MySQL, …            Timeline consistency with the data
                                       source
 Portable change event               Durability (by data source)
  serialization and versioning        At-least-once delivery
 Start consumption from arbitrary    Availability
  point                               Low latency

                                                                                19
LinkedIn Data Infrastructure Solutions
Voldemort: Highly-Available Distributed Data Store




                                                     20
Voldemort: Architecture




Highlights                In production
•   Open source           •   Data products
•   Pluggable components •    Network updates, sharing,
•   Tunable consistency /     page view tracking,
    availability              rate-limiting, more…
•   Key/value model,      •   Future: SSDs,
    server side “views”       multi-tenancy
LinkedIn Data Infrastructure Solutions
Espresso: Indexed Timeline-Consistent Distributed
           Data Store




                                                    22
Espresso: Key Design Points

 Hierarchical data model
   – InMail, Forums, Groups, Companies
 Native Change Data Capture Stream
   – Timeline consistency
   – Read after Write
 Rich functionality within a hierarchy
   – Local Secondary Indexes
   – Transactions
   – Full-text search
 Modular and Pluggable
   – Off-the-shelf: MySQL, Lucene, Avro



                                          23
Application View




                   24
Partitioning




               25
Partition Layout: Master, Slave
3 Storage Engine nodes, 2 way replication

   Database
                       P.1    P.2     P.3    P.5     P.6     P.7
 Partition: P.1
 Node: 1               P.4    P.5     P.6    P.8     P.1     P.2
 …
 Partition: P.12
 Node: 3
                       P.9    P.1            P.11    P.1
                              0                      2
                             Node 1                 Node 2
   Cluster

 Node: 1
 M: P.1 – Active       P.9    P.1     P.11
     …                        0
 S: P.5 – Active
     …                 P.1    P.3     P.4
                       2
                       P.7    P.8                            Master
   Cluster                                                   Slave
   Manager                   Node 3
Espresso: API

 REST over HTTP

 Get Messages for bob
    – GET /MailboxDB/MessageMeta/bob


 Get MsgId 3 for bob
    – GET /MailboxDB/MessageMeta/bob/3


 Get first page of Messages for bob that are unread and in the inbox
    – GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true
      +isInbox:true”&start=0&count=15




                                                                        27
Espresso: API Transactions

•   Add a message to bob’s mailbox
     •   transactionally update mailbox aggregates, insert into metadata and details.

         POST /MailboxDB/*/bob HTTP/1.1
         Content-Type: multipart/binary; boundary=1299799120
         Accept: application/json
         --1299799120
         Content-Type: application/json
         Content-Location: /MailboxDB/MessageStats/bob
         Content-Length: 50
         {“total”:”+1”, “unread”:”+1”}

         --1299799120
         Content-Type: application/json
         Content-Location: /MailboxDB/MessageMeta/bob
         Content-Length: 332
         {“from”:”…”,”subject”:”…”,…}

         --1299799120
         Content-Type: application/json
         Content-Location: /MailboxDB/MessageDetails/bob
         Content-Length: 542
         {“body”:”…”}

         --1299799120—




                                                                                        28
Espresso: System Components




                              29
Espresso @ LinkedIn

 First applications
   – Company Profiles
   – InMail
 Next
   – Unified Social Content Platform
   – Member Profiles
   – Many more…




                                       30
Espresso: Next steps

   Launched first application Oct 2011
   Open source 2012
   Multi-Datacenter support
   Log-structured storage
   Time-partitioned data




                                          31
Outline


 LinkedIn Products
 Data Ecosystem
 LinkedIn Data Infrastructure Solutions
 Next Play




                                           32
The Specialization Paradox in Distributed Systems


 Good: Build specialized
  systems so you can do each
  thing really well
 Bad: Rebuild distributed
  routing, failover, cluster
  management, monitoring,
  tooling




                                                    33
Generic Cluster Manager: Helix

• Generic Distributed State Model
• Centralized Config Management
• Automatic Load Balancing
• Fault tolerance
• Health monitoring
• Cluster expansion and
  rebalancing
• Open Source 2012
• Espresso, Databus and Search




                                    34
Stay tuned for

 Innovation
   – Nearline processing
   – Espresso eco-system
   – Storage / indexing
   – Analytics engine
   – Search
 Convergence
   – Building blocks for distributed data
     management systems

                                            35
Thanks!




          36
Appendix




           37
Espresso: Routing

 Router is a high-performance HTTP proxy
 Examines URL, extracts partition key
 Per-db routing strategy
   – Hash Based
   – Route To Any (for schema access)
   – Range (future)
 Routing function maps partition key to partition
 Cluster Manager maintains mapping of partition to hosts:
   – Single Master
   – Multiple Slaves




                                                             38
Espresso: Storage Node

 Data Store (MySQL)
   – Stores document as Avro serialized blob
   – Blob indexed by (partition key {, sub-key})
   – Row also contains limited metadata
        Etag, Last modified time, Avro schema version
 Document Schema specifies per-field index constraints
 Lucene index per partition key / resource




                                                          39
Espresso: Replication

 MySQL replication of mastered partitions
 MySQL “Slave” is MySQL instance with custom storage
  engine
   – custom storage engine just publishes to databus
 Per-database commit sequence number
 Replication is Databus
   – Supports existing downstream consumers
 Storage node consumes from Databus to update
  secondary indexes and slave partitions




                                                        40

More Related Content

What's hot

KBK Group Skolkovo Investor Presentation
KBK Group Skolkovo Investor PresentationKBK Group Skolkovo Investor Presentation
KBK Group Skolkovo Investor PresentationImmense Lab
 
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2Calpont Corporation
 
Security, Governance & Integration in a Cloud Connected World
Security, Governance & Integration in a Cloud Connected WorldSecurity, Governance & Integration in a Cloud Connected World
Security, Governance & Integration in a Cloud Connected WorldCA API Management
 
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridTop 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridScaleOut Software
 
IBM Managed Hosting - Linux virtual services
IBM Managed Hosting - Linux virtual servicesIBM Managed Hosting - Linux virtual services
IBM Managed Hosting - Linux virtual serviceswebhostingguy
 
Alliance Access Database Recovery
Alliance Access Database RecoveryAlliance Access Database Recovery
Alliance Access Database RecoverySWIFT
 
Neecom 2010 (Inovis-Dell case study)
Neecom 2010 (Inovis-Dell case study)Neecom 2010 (Inovis-Dell case study)
Neecom 2010 (Inovis-Dell case study)Doug Kern
 
I'll See You On the Write Side of the Web
I'll See You On the Write Side of the WebI'll See You On the Write Side of the Web
I'll See You On the Write Side of the WebStuart Charlton
 
The Art & Sience of Optimization
The Art & Sience of OptimizationThe Art & Sience of Optimization
The Art & Sience of OptimizationHertzel Karbasi
 
How a Cloud Computing Provider Reached the Holy Grail of Visibility
How a Cloud Computing Provider Reached the Holy Grail of VisibilityHow a Cloud Computing Provider Reached the Holy Grail of Visibility
How a Cloud Computing Provider Reached the Holy Grail of Visibilityeladgotfrid
 
Linking Data and Actions on the Web
Linking Data and Actions on the WebLinking Data and Actions on the Web
Linking Data and Actions on the WebStuart Charlton
 
Build A Flexible Application Infrastructure Environment Web Sphere Connectivi...
Build A Flexible Application Infrastructure Environment Web Sphere Connectivi...Build A Flexible Application Infrastructure Environment Web Sphere Connectivi...
Build A Flexible Application Infrastructure Environment Web Sphere Connectivi...Carly Snodgrass
 
Take the spaghetti out of windows azure – an insight for it pro techies part 1
Take the spaghetti out of windows azure – an insight for it pro techies part 1Take the spaghetti out of windows azure – an insight for it pro techies part 1
Take the spaghetti out of windows azure – an insight for it pro techies part 1Microsoft TechNet - Belgium and Luxembourg
 
2012 open storage summit keynote
2012 open storage summit   keynote2012 open storage summit   keynote
2012 open storage summit keynoteRandy Bias
 
Sql azure database under the hood
Sql azure database under the hoodSql azure database under the hood
Sql azure database under the hoodguest2dd056
 
Lenovo: The Cloud Over BYOD
Lenovo: The Cloud Over BYODLenovo: The Cloud Over BYOD
Lenovo: The Cloud Over BYODLenovo Education
 

What's hot (20)

KBK Group Skolkovo Investor Presentation
KBK Group Skolkovo Investor PresentationKBK Group Skolkovo Investor Presentation
KBK Group Skolkovo Investor Presentation
 
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
 
Security, Governance & Integration in a Cloud Connected World
Security, Governance & Integration in a Cloud Connected WorldSecurity, Governance & Integration in a Cloud Connected World
Security, Governance & Integration in a Cloud Connected World
 
Cliser
CliserCliser
Cliser
 
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridTop 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data Grid
 
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
 
MySQL Cluster
MySQL ClusterMySQL Cluster
MySQL Cluster
 
IBM Managed Hosting - Linux virtual services
IBM Managed Hosting - Linux virtual servicesIBM Managed Hosting - Linux virtual services
IBM Managed Hosting - Linux virtual services
 
Alliance Access Database Recovery
Alliance Access Database RecoveryAlliance Access Database Recovery
Alliance Access Database Recovery
 
Neecom 2010 (Inovis-Dell case study)
Neecom 2010 (Inovis-Dell case study)Neecom 2010 (Inovis-Dell case study)
Neecom 2010 (Inovis-Dell case study)
 
I'll See You On the Write Side of the Web
I'll See You On the Write Side of the WebI'll See You On the Write Side of the Web
I'll See You On the Write Side of the Web
 
The Art & Sience of Optimization
The Art & Sience of OptimizationThe Art & Sience of Optimization
The Art & Sience of Optimization
 
How a Cloud Computing Provider Reached the Holy Grail of Visibility
How a Cloud Computing Provider Reached the Holy Grail of VisibilityHow a Cloud Computing Provider Reached the Holy Grail of Visibility
How a Cloud Computing Provider Reached the Holy Grail of Visibility
 
Chris millercloud
Chris millercloudChris millercloud
Chris millercloud
 
Linking Data and Actions on the Web
Linking Data and Actions on the WebLinking Data and Actions on the Web
Linking Data and Actions on the Web
 
Build A Flexible Application Infrastructure Environment Web Sphere Connectivi...
Build A Flexible Application Infrastructure Environment Web Sphere Connectivi...Build A Flexible Application Infrastructure Environment Web Sphere Connectivi...
Build A Flexible Application Infrastructure Environment Web Sphere Connectivi...
 
Take the spaghetti out of windows azure – an insight for it pro techies part 1
Take the spaghetti out of windows azure – an insight for it pro techies part 1Take the spaghetti out of windows azure – an insight for it pro techies part 1
Take the spaghetti out of windows azure – an insight for it pro techies part 1
 
2012 open storage summit keynote
2012 open storage summit   keynote2012 open storage summit   keynote
2012 open storage summit keynote
 
Sql azure database under the hood
Sql azure database under the hoodSql azure database under the hood
Sql azure database under the hood
 
Lenovo: The Cloud Over BYOD
Lenovo: The Cloud Over BYODLenovo: The Cloud Over BYOD
Lenovo: The Cloud Over BYOD
 

Viewers also liked (12)

Cell respiration part 2
Cell respiration part 2Cell respiration part 2
Cell respiration part 2
 
Cellresp.
Cellresp.Cellresp.
Cellresp.
 
Pyruvate Reactions
Pyruvate ReactionsPyruvate Reactions
Pyruvate Reactions
 
Glycolysis terms
Glycolysis termsGlycolysis terms
Glycolysis terms
 
Blood Glucose
Blood GlucoseBlood Glucose
Blood Glucose
 
Steps of Glycolysis in a brief
Steps of Glycolysis in a brief Steps of Glycolysis in a brief
Steps of Glycolysis in a brief
 
Mt DNA
Mt DNAMt DNA
Mt DNA
 
Glycolysis ppt
Glycolysis pptGlycolysis ppt
Glycolysis ppt
 
Mitochondrial dna
Mitochondrial dnaMitochondrial dna
Mitochondrial dna
 
Glycolysis (10 Steps) By: Asar Khan
Glycolysis (10 Steps) By: Asar KhanGlycolysis (10 Steps) By: Asar Khan
Glycolysis (10 Steps) By: Asar Khan
 
Glycolysis
GlycolysisGlycolysis
Glycolysis
 
Glycolysis- An over view
Glycolysis- An over viewGlycolysis- An over view
Glycolysis- An over view
 

Similar to Xldb2011 tue 1005_linked_in

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
OSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal SternOSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal SternOpenStorageSummit
 
Inaugural address manjusha - Indicthreads cloud computing conference 2011
Inaugural address manjusha -  Indicthreads cloud computing conference 2011Inaugural address manjusha -  Indicthreads cloud computing conference 2011
Inaugural address manjusha - Indicthreads cloud computing conference 2011IndicThreads
 
Mike Taulty MIX10 Silverlight 4 Patterns Frameworks
Mike Taulty MIX10 Silverlight 4 Patterns FrameworksMike Taulty MIX10 Silverlight 4 Patterns Frameworks
Mike Taulty MIX10 Silverlight 4 Patterns Frameworksukdpe
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...SL Corporation
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentationpbridges
 
13h00 p duff-building-applications-with-aws-final
13h00   p duff-building-applications-with-aws-final13h00   p duff-building-applications-with-aws-final
13h00 p duff-building-applications-with-aws-finalLuiz Gustavo Santos
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motionconfluent
 
Service Oriented Architecture (SOA) [1/5] : Introduction to SOA
Service Oriented Architecture (SOA) [1/5] : Introduction to SOAService Oriented Architecture (SOA) [1/5] : Introduction to SOA
Service Oriented Architecture (SOA) [1/5] : Introduction to SOAIMC Institute
 
Kuali update v4 - mw
Kuali update   v4 - mwKuali update   v4 - mw
Kuali update v4 - mwsarnoa
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Shirshanka Das
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Knowledge Base+: a Cloud-Based Community Knowledge Base
Knowledge Base+: a Cloud-Based Community Knowledge BaseKnowledge Base+: a Cloud-Based Community Knowledge Base
Knowledge Base+: a Cloud-Based Community Knowledge Basesherif user group
 
The Enterprise Cloud: Immediate. Urgent. Inevitable.
The Enterprise Cloud: Immediate. Urgent. Inevitable.The Enterprise Cloud: Immediate. Urgent. Inevitable.
The Enterprise Cloud: Immediate. Urgent. Inevitable.Peter Coffee
 
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011Antonio Alba
 

Similar to Xldb2011 tue 1005_linked_in (20)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
OSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal SternOSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal Stern
 
Inaugural address manjusha - Indicthreads cloud computing conference 2011
Inaugural address manjusha -  Indicthreads cloud computing conference 2011Inaugural address manjusha -  Indicthreads cloud computing conference 2011
Inaugural address manjusha - Indicthreads cloud computing conference 2011
 
Mike Taulty MIX10 Silverlight 4 Patterns Frameworks
Mike Taulty MIX10 Silverlight 4 Patterns FrameworksMike Taulty MIX10 Silverlight 4 Patterns Frameworks
Mike Taulty MIX10 Silverlight 4 Patterns Frameworks
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
 
Secure Big Data Analytics - Hadoop & Intel
Secure Big Data Analytics - Hadoop & IntelSecure Big Data Analytics - Hadoop & Intel
Secure Big Data Analytics - Hadoop & Intel
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentation
 
13h00 p duff-building-applications-with-aws-final
13h00   p duff-building-applications-with-aws-final13h00   p duff-building-applications-with-aws-final
13h00 p duff-building-applications-with-aws-final
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motion
 
Building Applications with AWS
Building Applications with AWSBuilding Applications with AWS
Building Applications with AWS
 
Service Oriented Architecture (SOA) [1/5] : Introduction to SOA
Service Oriented Architecture (SOA) [1/5] : Introduction to SOAService Oriented Architecture (SOA) [1/5] : Introduction to SOA
Service Oriented Architecture (SOA) [1/5] : Introduction to SOA
 
Kuali update v4 - mw
Kuali update   v4 - mwKuali update   v4 - mw
Kuali update v4 - mw
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Knowledge Base+: a Cloud-Based Community Knowledge Base
Knowledge Base+: a Cloud-Based Community Knowledge BaseKnowledge Base+: a Cloud-Based Community Knowledge Base
Knowledge Base+: a Cloud-Based Community Knowledge Base
 
The Enterprise Cloud: Immediate. Urgent. Inevitable.
The Enterprise Cloud: Immediate. Urgent. Inevitable.The Enterprise Cloud: Immediate. Urgent. Inevitable.
The Enterprise Cloud: Immediate. Urgent. Inevitable.
 
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
 
What's new in Exchange 2013?
What's new in Exchange 2013?What's new in Exchange 2013?
What's new in Exchange 2013?
 

More from liqiang xu

浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909liqiang xu
 
Csrf攻击原理及防御措施
Csrf攻击原理及防御措施Csrf攻击原理及防御措施
Csrf攻击原理及防御措施liqiang xu
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksliqiang xu
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerliqiang xu
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsliqiang xu
 
Xldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseXldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseliqiang xu
 
Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)liqiang xu
 
大话Php之性能
大话Php之性能大话Php之性能
大话Php之性能liqiang xu
 
Nginx internals
Nginx internalsNginx internals
Nginx internalsliqiang xu
 
1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)liqiang xu
 
1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)liqiang xu
 

More from liqiang xu (12)

浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909
 
Csrf攻击原理及防御措施
Csrf攻击原理及防御措施Csrf攻击原理及防御措施
Csrf攻击原理及防御措施
 
Hdfs comics
Hdfs comicsHdfs comics
Hdfs comics
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocks
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastner
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
 
Xldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouseXldb2011 tue 1120_youtube_datawarehouse
Xldb2011 tue 1120_youtube_datawarehouse
 
Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)
 
大话Php之性能
大话Php之性能大话Php之性能
大话Php之性能
 
Nginx internals
Nginx internalsNginx internals
Nginx internals
 
1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)
 
1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)
 

Recently uploaded

The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 

Recently uploaded (20)

The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 

Xldb2011 tue 1005_linked_in

  • 1. Data Infrastructure at LinkedIn Shirshanka Das XLDB 2011 1
  • 2. Me  UCLA Ph.D. 2005 (Distributed protocols in content delivery networks)  PayPal (Web frameworks and Session Stores)  Yahoo! (Serving Infrastructure, Graph Indexing, Real-time Bidding in Display Ad Exchanges)  @ LinkedIn (Distributed Data Systems team): Distributed data transport and storage technology (Kafka, Databus, Espresso, ...) 2
  • 3. Outline  LinkedIn Products  Data Ecosystem  LinkedIn Data Infrastructure Solutions  Next Play 3
  • 4. LinkedIn By The Numbers  120,000,000+ users in August 2011  2 new user registrations per second  4 billion People Searches expected in 2011  2+ million companies with LinkedIn Company Pages  81+ million unique visitors monthly*  150K domains feature the LinkedIn Share Button  7.1 billion page views in Q2 2011  1M LinkedIn Groups * Based on comScore, Q2 2011 4
  • 6. Signal - faceted stream search 6
  • 7. People You May Know 7
  • 8. Outline  LinkedIn Products  Data Ecosystem  LinkedIn Data Infrastructure Solutions  Next Play 8
  • 9. Three Paradigms : Simplifying the Data Continuum • Member Profiles • Signal • People You May Know • Company Profiles • Profile Standardization • Connection Strength • Connections • News • News • Communications • Recommendations • Recommendations • Search • Next best idea • Communications Online Nearline Offline Activity that should Activity that should Activity that can be be reflected immediately be reflected soon reflected later 9
  • 10. Data Infrastructure Toolbox (Online) Capabilities Systems Analysis Key-value access Rich structures (e.g. indexes) Change capture capability { Search platform Graph engine 10
  • 11. Data Infrastructure Toolbox (Nearline) Capabilities Systems Analysis Change capture streams Messaging for site events, monitoring Nearline processing 11
  • 12. Data Infrastructure Toolbox (Offline) Capabilities Systems Analysis Machine learning, ranking, relevance Analytics on Social gestures 12
  • 13. Laying out the tools 13
  • 14. Outline  LinkedIn Products  Data Ecosystem  LinkedIn Data Infrastructure Solutions  Next Play 14
  • 15. Focus on four systems in Online and Nearline  Data Transport – Kafka – Databus  Online Data Stores – Voldemort – Espresso 15
  • 16. LinkedIn Data Infrastructure Solutions Kafka: High-Volume Low-Latency Messaging System 16
  • 17. Kafka: Architecture Broker Tier WebTier Consumers Push Sequential write sendfile Pull Iterator 1 Client Lib Event Events Topic 1 Kafka s 100 MB/sec 200 MB/sec Topic 2 Iterator n Topic N Topic  Offset Topic, Partition Offset Ownership Zookeeper Management Scale Guarantees  Billions of Events  At least once delivery  TBs per day  Very high throughput  Inter-colo: few seconds  Low latency  Typical retention: weeks  Durability 17
  • 18. LinkedIn Data Infrastructure Solutions Databus : Timeline-Consistent Change Data Capture 18
  • 19. Databus at LinkedIn Client Relay Consumer 1 Client Lib Capture Databus On-line DB Changes Changes Event Win Consumer n On-line Changes Bootstrap Client Consumer 1 Client Lib Databus Consistent Snapshot at U Consumer n DB Features Guarantees  Transport independent of data  Transactional semantics source: Oracle, MySQL, …  Timeline consistency with the data source  Portable change event  Durability (by data source) serialization and versioning  At-least-once delivery  Start consumption from arbitrary  Availability point  Low latency 19
  • 20. LinkedIn Data Infrastructure Solutions Voldemort: Highly-Available Distributed Data Store 20
  • 21. Voldemort: Architecture Highlights In production • Open source • Data products • Pluggable components • Network updates, sharing, • Tunable consistency / page view tracking, availability rate-limiting, more… • Key/value model, • Future: SSDs, server side “views” multi-tenancy
  • 22. LinkedIn Data Infrastructure Solutions Espresso: Indexed Timeline-Consistent Distributed Data Store 22
  • 23. Espresso: Key Design Points  Hierarchical data model – InMail, Forums, Groups, Companies  Native Change Data Capture Stream – Timeline consistency – Read after Write  Rich functionality within a hierarchy – Local Secondary Indexes – Transactions – Full-text search  Modular and Pluggable – Off-the-shelf: MySQL, Lucene, Avro 23
  • 26. Partition Layout: Master, Slave 3 Storage Engine nodes, 2 way replication Database P.1 P.2 P.3 P.5 P.6 P.7 Partition: P.1 Node: 1 P.4 P.5 P.6 P.8 P.1 P.2 … Partition: P.12 Node: 3 P.9 P.1 P.11 P.1 0 2 Node 1 Node 2 Cluster Node: 1 M: P.1 – Active P.9 P.1 P.11 … 0 S: P.5 – Active … P.1 P.3 P.4 2 P.7 P.8 Master Cluster Slave Manager Node 3
  • 27. Espresso: API  REST over HTTP  Get Messages for bob – GET /MailboxDB/MessageMeta/bob  Get MsgId 3 for bob – GET /MailboxDB/MessageMeta/bob/3  Get first page of Messages for bob that are unread and in the inbox – GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true +isInbox:true”&start=0&count=15 27
  • 28. Espresso: API Transactions • Add a message to bob’s mailbox • transactionally update mailbox aggregates, insert into metadata and details. POST /MailboxDB/*/bob HTTP/1.1 Content-Type: multipart/binary; boundary=1299799120 Accept: application/json --1299799120 Content-Type: application/json Content-Location: /MailboxDB/MessageStats/bob Content-Length: 50 {“total”:”+1”, “unread”:”+1”} --1299799120 Content-Type: application/json Content-Location: /MailboxDB/MessageMeta/bob Content-Length: 332 {“from”:”…”,”subject”:”…”,…} --1299799120 Content-Type: application/json Content-Location: /MailboxDB/MessageDetails/bob Content-Length: 542 {“body”:”…”} --1299799120— 28
  • 30. Espresso @ LinkedIn  First applications – Company Profiles – InMail  Next – Unified Social Content Platform – Member Profiles – Many more… 30
  • 31. Espresso: Next steps  Launched first application Oct 2011  Open source 2012  Multi-Datacenter support  Log-structured storage  Time-partitioned data 31
  • 32. Outline  LinkedIn Products  Data Ecosystem  LinkedIn Data Infrastructure Solutions  Next Play 32
  • 33. The Specialization Paradox in Distributed Systems  Good: Build specialized systems so you can do each thing really well  Bad: Rebuild distributed routing, failover, cluster management, monitoring, tooling 33
  • 34. Generic Cluster Manager: Helix • Generic Distributed State Model • Centralized Config Management • Automatic Load Balancing • Fault tolerance • Health monitoring • Cluster expansion and rebalancing • Open Source 2012 • Espresso, Databus and Search 34
  • 35. Stay tuned for  Innovation – Nearline processing – Espresso eco-system – Storage / indexing – Analytics engine – Search  Convergence – Building blocks for distributed data management systems 35
  • 36. Thanks! 36
  • 37. Appendix 37
  • 38. Espresso: Routing  Router is a high-performance HTTP proxy  Examines URL, extracts partition key  Per-db routing strategy – Hash Based – Route To Any (for schema access) – Range (future)  Routing function maps partition key to partition  Cluster Manager maintains mapping of partition to hosts: – Single Master – Multiple Slaves 38
  • 39. Espresso: Storage Node  Data Store (MySQL) – Stores document as Avro serialized blob – Blob indexed by (partition key {, sub-key}) – Row also contains limited metadata  Etag, Last modified time, Avro schema version  Document Schema specifies per-field index constraints  Lucene index per partition key / resource 39
  • 40. Espresso: Replication  MySQL replication of mastered partitions  MySQL “Slave” is MySQL instance with custom storage engine – custom storage engine just publishes to databus  Per-database commit sequence number  Replication is Databus – Supports existing downstream consumers  Storage node consumes from Databus to update secondary indexes and slave partitions 40