SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
Databus




1/29/2013   Recruiting Solutions
            `                        Databus   1
INTRODUCTION


  `            2
LinkedIn by Numbers
 World’s largest professional network
 187M+ members world-wide as of Q3 2012
   Growing at the rate of two per second
 85 of Fortune 100 companies use Talent Solutions
  to hire
 > 2.6M company pages
 > 4B search queries
 75K+ developers leveraging out APIs
 1.3M unique publishers


     `                    Databus                3
The Consequence of Specialization in
           Data Systems
Data Flow is essential
Data Consistency is critical !!!




        `
Solution: Databus



              Standardi
               Standardi    Standardi
                             Standardi   Standardi
                                          Standardi   Standardi
                                                       Standardi
                Standardi       Search       Graph         Read
  Updates




                zation
                 zation       zation
                               zation      zation
                                            zation      zation
                                                         zation
                  zation         Index       Index       Replicas




Primary
   DB                       Data Change Events

                               Databus

  `                                                                 5
Two Ways


    Application code dual    Extract changes from
    writes to database and   database commit log
    pub-sub system




    Easy on the surface      Tough but possible

    Consistent?              Consistent!!!



`
Key Design Decisions : Semantics
• Logical clocks attached to the source
  – Physical offsets could be used for internal
    transport
  – Simplifies data portability
• Pull model
  – Restarts are simple
  – Derived State = f (Source state, Clock)
  – + Idempotence = Timeline Consistent!

     `                                            7
Key Design Decisions : Systems
• Isolate fast consumers from slow consumers
  – Workload separation between online, catch-up,
    bootstrap
• Isolate sources from consumers
  – Schema changes
  – Physical layout changes
  – Speed mismatch
• Schema-aware
  – Filtering, Projections
  – Typically network-bound  can burn more CPU

     `                                              8
Requirements
•   Timeline consistency
•   Guaranteed, at least once delivery
•   Low latency
•   Schema evolution
•   Source independence
•   Scalable consumers
•   Handle for slow/new consumers without
    affecting happy ones (look-back requirements)

      `                                         9
ARCHITECTURE


  `            10
0
                          Initial Design (2007)                                   Happy
                                                                                 Consumer
         Source clock
            timer




                                                                                       …
             SCN

                                 Direct Pull              Relay                    Happy
                            DB                           In Memory                Consumer
 70000                                                     Buffer
                                  Proxied
             3 hrs
                                    Pull
100000       Relay
102400                                                                             Slow
                     DB
                                                                                 Consumer




Pros:
                                                  Cons:
1. Consumer Scaling
                                                  Slow consumers overwhelming the DB
2. Some isolation


                `                              Databus                                 11
Software Architecture
                 Four Logical Components

                  • Fetcher
                      – Fetch from db, relay…
                  • Log Store
                      – Store log snippet
                  • Snapshot Store
                      – Store moving data
                        snapshot
                  • Subscription Client
                      – Orchestrate pull
                        across these


`
0
        Source clock
           timer
            SCN
                                  The Databus System                                    Happy
                       Snapshot                                                        Consumer




                                                                                         …
                                  infinite
30000                  Log
                                                           Relay                        Happy
                       10 days                        In Memory                        Consumer
 70000   Relay
                                                        Buffer
 80000
 90000     3 hrs
100000
102400                                                                                   Slow
                       DB
                                                                                       Consumer


                                                                   Server


                                             Log Storage              Snapshot Store


                                                                   Bootstrap Service

                   `                                                                       13
The Relay
•   Change event buffering (~ 2 – 7 days)
•   Low latency (10-15 ms)
•   Filtering, Projection
•   Hundreds of consumers per relay
•   Scale-out, High-availability through
    redundancy



       `
Deployment Options




Option 1: Peered Deployment   Option 2: Clustered Deployment

     `
The Bootstrap Service
•   Catch-all for slow / new consumers
•   Isolate source OLTP instance from large scans
•   Log Store + Snapshot Store
•   Optimizations
    – Periodic merge
    – Predicate push-down
    – Catch-up versus full bootstrap
• Guaranteed progress for consumers via chunking
• Implementations
    – Database (MySQL)
    – Raw Files
• Bridges the continuum between stream and batch systems

       `
The Consumer Client Library
• Glue between Databus infra and business logic
  in the consumer
• Isolates the consumer from changes in the
  databus layer.
• Switches between relay and bootstrap as
  needed
• API
  – Callback with transactions
  – Iterators over windows

    `
Fetcher Implementations
• Oracle
  – Trigger-based
• MySQL
  – Custom-storage-engine based
• In Labs
  – Alternative implementations for Oracle
  – OpenReplicator integration for MySQL


     `
Meta-data Management
• Event definition, serialization and transport
  – Avro
• Oracle, MySQL
  – Avro definition generated from the table schema
• Schema evolution
  – Only backwards-compatible changes allowed
• Isolation between upgrades on producer and
  consumer

     `
Scaling the consumers
                (Partitioning)
• Server-side filtering
  – Range, mod, hash
  – Allows client to control partitioning function
• Consumer groups
  – Distribute partitions evenly across a group
  – Move partitions to available consumers on failure
  – Minimize re-processing



     `
A NEW CONSUMER


 `               21
Development with Databus – Client
                   Library
    Databus Client

                     Consumers
                      Consumers

                          implement



          Stream Event      Bootstrap Event
             Callback          Callback                        Client API
               API                API

                                   Databus Client Library

onDataEvent(DbusEvent, Decoder)                  register(consumers, sources , filter)
…                                                start() ,
…
                                                 shutdown(),

           `                           Databus                                      22
Databus Consumer Implementation
class MyConsumer
      extends AbstractDatabusStreamConsumer
{
   ConsumerCallbackResult onDataEvent(DbusEvent e,
                                       DbusEventDecoder d){
    //use map-like Avro GenericRecord
    GenericRecord g = d.getGenericRecord(e, null);
    //or use the auto-generated Java class
    MyEvent e = d.getTypedValue(e, null,
                                            MyEvent.class);
    …
    return ConsumerCallbackResult.SUCCESS;
  }
}

     `                     Databus                       23
Starting the client
public void main(String[]) {
  //configure
  DatabusHttpClientImpl.Config clientConfig =
                          new DatabusHttpClientImpl.Config();
  clientConfig.loadFromFile(“mydbus”, “mdbus.props”);
  DatabusHttpClientImpl client =
               new DatabusHttpClientImpl(clientConfig);
  //register callback
  MyConsumer callback = new MyConsumer();
  client.registerDatabusStreamListener(callback,
          null, "com.linkedin.events.member2.MemberProfile”);
  //start client library
  client.startAndBlock();
}

        `                    Databus                        24
Event Callback APIs
•




    `           Databus       25
PERFORMANCE


 `            26
Relay Throughput




`         Databus      27
Consumer Throughput




`           Databus       28
End-End Latency




`         Databus     29
Snapshot vs Catchup




`           Databus       30
Recruiting Solutions   31

Mais conteúdo relacionado

Mais procurados

Characteristics of no sql databases
Characteristics of no sql databasesCharacteristics of no sql databases
Characteristics of no sql databases
Dipti Borkar
 
Aceleracion de aplicacione 2
Aceleracion de aplicacione 2Aceleracion de aplicacione 2
Aceleracion de aplicacione 2
jfth
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Spark Summit
 

Mais procurados (20)

Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Characteristics of no sql databases
Characteristics of no sql databasesCharacteristics of no sql databases
Characteristics of no sql databases
 
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scaleHow LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
 
Couchbase and Apache Spark
Couchbase and Apache SparkCouchbase and Apache Spark
Couchbase and Apache Spark
 
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
 
Aceleracion de aplicacione 2
Aceleracion de aplicacione 2Aceleracion de aplicacione 2
Aceleracion de aplicacione 2
 
In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified!
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
 
Building Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodeBuilding Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache Geode
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 

Semelhante a Introduction to Databus

Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflow
rjmurphyslideshare
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_in
liqiang xu
 
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Michael Noel
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
Edward Capriolo
 
Zeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMCZeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMC
Arvind Krishnaa
 

Semelhante a Introduction to Databus (20)

Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflow
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 ServersHow to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
 
Times Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo LudasTimes Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo Ludas
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_in
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
 
Lync 2010 High Availability
Lync 2010 High AvailabilityLync 2010 High Availability
Lync 2010 High Availability
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Zeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMCZeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMC
 
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridTop 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data Grid
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scale
 
Oracle in the Cloud
Oracle in the CloudOracle in the Cloud
Oracle in the Cloud
 

Mais de Amy W. Tang

Mais de Amy W. Tang (11)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data Application
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using Helix
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Introduction to Databus

  • 1. Databus 1/29/2013 Recruiting Solutions ` Databus 1
  • 3. LinkedIn by Numbers  World’s largest professional network  187M+ members world-wide as of Q3 2012  Growing at the rate of two per second  85 of Fortune 100 companies use Talent Solutions to hire  > 2.6M company pages  > 4B search queries  75K+ developers leveraging out APIs  1.3M unique publishers ` Databus 3
  • 4. The Consequence of Specialization in Data Systems Data Flow is essential Data Consistency is critical !!! `
  • 5. Solution: Databus Standardi Standardi Standardi Standardi Standardi Standardi Standardi Standardi Standardi Search Graph Read Updates zation zation zation zation zation zation zation zation zation Index Index Replicas Primary DB Data Change Events Databus ` 5
  • 6. Two Ways Application code dual Extract changes from writes to database and database commit log pub-sub system Easy on the surface Tough but possible Consistent? Consistent!!! `
  • 7. Key Design Decisions : Semantics • Logical clocks attached to the source – Physical offsets could be used for internal transport – Simplifies data portability • Pull model – Restarts are simple – Derived State = f (Source state, Clock) – + Idempotence = Timeline Consistent! ` 7
  • 8. Key Design Decisions : Systems • Isolate fast consumers from slow consumers – Workload separation between online, catch-up, bootstrap • Isolate sources from consumers – Schema changes – Physical layout changes – Speed mismatch • Schema-aware – Filtering, Projections – Typically network-bound  can burn more CPU ` 8
  • 9. Requirements • Timeline consistency • Guaranteed, at least once delivery • Low latency • Schema evolution • Source independence • Scalable consumers • Handle for slow/new consumers without affecting happy ones (look-back requirements) ` 9
  • 11. 0 Initial Design (2007) Happy Consumer Source clock timer … SCN Direct Pull Relay Happy DB In Memory Consumer 70000 Buffer Proxied 3 hrs Pull 100000 Relay 102400 Slow DB Consumer Pros: Cons: 1. Consumer Scaling Slow consumers overwhelming the DB 2. Some isolation ` Databus 11
  • 12. Software Architecture Four Logical Components • Fetcher – Fetch from db, relay… • Log Store – Store log snippet • Snapshot Store – Store moving data snapshot • Subscription Client – Orchestrate pull across these `
  • 13. 0 Source clock timer SCN The Databus System Happy Snapshot Consumer … infinite 30000 Log Relay Happy 10 days In Memory Consumer 70000 Relay Buffer 80000 90000 3 hrs 100000 102400 Slow DB Consumer Server Log Storage Snapshot Store Bootstrap Service ` 13
  • 14. The Relay • Change event buffering (~ 2 – 7 days) • Low latency (10-15 ms) • Filtering, Projection • Hundreds of consumers per relay • Scale-out, High-availability through redundancy `
  • 15. Deployment Options Option 1: Peered Deployment Option 2: Clustered Deployment `
  • 16. The Bootstrap Service • Catch-all for slow / new consumers • Isolate source OLTP instance from large scans • Log Store + Snapshot Store • Optimizations – Periodic merge – Predicate push-down – Catch-up versus full bootstrap • Guaranteed progress for consumers via chunking • Implementations – Database (MySQL) – Raw Files • Bridges the continuum between stream and batch systems `
  • 17. The Consumer Client Library • Glue between Databus infra and business logic in the consumer • Isolates the consumer from changes in the databus layer. • Switches between relay and bootstrap as needed • API – Callback with transactions – Iterators over windows `
  • 18. Fetcher Implementations • Oracle – Trigger-based • MySQL – Custom-storage-engine based • In Labs – Alternative implementations for Oracle – OpenReplicator integration for MySQL `
  • 19. Meta-data Management • Event definition, serialization and transport – Avro • Oracle, MySQL – Avro definition generated from the table schema • Schema evolution – Only backwards-compatible changes allowed • Isolation between upgrades on producer and consumer `
  • 20. Scaling the consumers (Partitioning) • Server-side filtering – Range, mod, hash – Allows client to control partitioning function • Consumer groups – Distribute partitions evenly across a group – Move partitions to available consumers on failure – Minimize re-processing `
  • 22. Development with Databus – Client Library Databus Client Consumers Consumers implement Stream Event Bootstrap Event Callback Callback Client API API API Databus Client Library onDataEvent(DbusEvent, Decoder) register(consumers, sources , filter) … start() , … shutdown(), ` Databus 22
  • 23. Databus Consumer Implementation class MyConsumer extends AbstractDatabusStreamConsumer { ConsumerCallbackResult onDataEvent(DbusEvent e, DbusEventDecoder d){ //use map-like Avro GenericRecord GenericRecord g = d.getGenericRecord(e, null); //or use the auto-generated Java class MyEvent e = d.getTypedValue(e, null, MyEvent.class); … return ConsumerCallbackResult.SUCCESS; } } ` Databus 23
  • 24. Starting the client public void main(String[]) { //configure DatabusHttpClientImpl.Config clientConfig = new DatabusHttpClientImpl.Config(); clientConfig.loadFromFile(“mydbus”, “mdbus.props”); DatabusHttpClientImpl client = new DatabusHttpClientImpl(clientConfig); //register callback MyConsumer callback = new MyConsumer(); client.registerDatabusStreamListener(callback, null, "com.linkedin.events.member2.MemberProfile”); //start client library client.startAndBlock(); } ` Databus 24
  • 25. Event Callback APIs • ` Databus 25
  • 27. Relay Throughput ` Databus 27
  • 28. Consumer Throughput ` Databus 28
  • 29. End-End Latency ` Databus 29
  • 30. Snapshot vs Catchup ` Databus 30