SlideShare uma empresa Scribd logo
1 de 20
All Aboard the Databus!
LinkedIn’s Change Data Capture Pipeline
                                           ACM SOCC 2012
                                           Oct 16th



Databus Team @ LinkedIn
Shirshanka Das
http://www.linkedin.com/in/shirshankadas
@shirshanka


      Recruiting Solutions
The Consequence of Specialization in Data Systems


Data Flow is essential
Data Consistency is critical !!!
The Timeline Consistent Data Flow problem
Two Ways




           Application code dual    Extract changes from
           writes to database and   database commit log
           pub-sub system




           Easy on the surface      Tough but possible

           Consistent?              Consistent!!!
The Result: Databus




                Standar
                 Standar     Standar
                              Standar    Standar
                                          Standar    Standar
                                                      Standar
      Updates




                  Standar
                dization       Search
                             dization       Graph
                                         dization       Read
                                                     dization
                 dization
                  dization    dization
                                Index     dization
                                            Index     dization
                                                      Replicas




    Primary
      DB                     Data Change Events

                               Databus

                                                                 5
Key Design Decisions : Semantics

 Logical clocks attached to the source
   – Physical offsets are only used for internal transport
   – Simplifies data portability
 Pull model
   – Restarts are simple
   – Derived State = f (Source state, Clock)
   – + Idempotence = Timeline Consistent!




                                                             6
Key Design Decisions : Systems

 Isolate fast consumers from slow consumers
   – Workload separation between online, catch-up, bootstrap
 Isolate sources from consumers
   – Schema changes
   – Physical layout changes
   – Speed mismatch
 Schema-aware
   – Filtering, Projections
   – Typically network-bound  can burn more CPU




                                                               7
Databus: First attempt (2007)


                            Issues

                             Source database pressure
                              caused by slow consumers
                             Brittle serialization
Current Architecture (2011)


                              Four Logical Components


                                Fetcher
                                   – Fetch from db,
                                     relay…
                                Log Store
                                   – Store log snippet
                                Snapshot Store
                                   – Store moving data
                                     snapshot
                                Subscription Client
                                   – Orchestrate pull
                                     across these
The Relay

   Change event buffering (~ 2 – 7 days)
   Low latency (10-15 ms)
   Filtering, Projection
   Hundreds of consumers per relay
   Scale-out, High-availability through redundancy




    Option 1: Peered Deployment   Option 2: Clustered Deployment
The Bootstrap Service

   Catch-all for slow / new consumers
   Isolate source OLTP instance from large scans
   Log Store + Snapshot Store
   Optimizations
    – Periodic merge
    – Predicate push-down
    – Catch-up versus full bootstrap
 Guaranteed progress for consumers via chunking
 Implementations
    – Database (MySQL)
    – Raw Files
 Bridges the continuum between stream and batch systems
The Consumer Client Library

 Glue between Databus infra and business
  logic in the consumer
 Switches between relay and bootstrap as
  needed
 API
  – Callback with transactions
  – Iterators over windows
Fetcher Implementations

 Oracle
   – Trigger-based (see paper for details)
 MySQL
   – Custom-storage-engine based (see paper for details)
 In Labs
   – Alternative implementations for Oracle
   – OpenReplicator integration for MySQL
Meta-data Management

 Event definition, serialization and transport
   – Avro
 Oracle, MySQL
   – Table schema generates Avro definition
 Schema evolution
   – Only backwards-compatible changes allowed
 Isolation between upgrades on producer and consumer
Partitioning the Stream

 Server-side filtering
   – Range, mod, hash
   – Allows client to control partitioning function
 Consumer groups
   – Distribute partitions evenly across a group
   – Move partitions to available consumers on failure
   – Minimize re-processing
Experience in Production: The Good
 Source isolation: Bootstrap benefits
   – Typically, data extracted from sources just once
   – Bootstrap service routinely used to satisfy new or slow
     consumers
 Common Data Format
   – Early versions used hand-written Java classes for schema  Too
     brittle
   – Java classes also meant many different serializations for versions
     of the classes
   – Avro offers ease-of-use flexibility & performance improvements
     (no re-marshaling)
 Rich Subscription Support
   – Example: Search, Relevance
Experience in Production: The Bad
 Oracle Fetcher Performance Bottlenecks
   – Complex joins
   – BLOBS and CLOBS
   – High update rate driven contention on trigger table
 Bootstrap: Snapshot store seeding
   – Consistent snapshot extraction from large sources
   – Complex joins hurt when trying to create exactly the same results
What’s Next?

 Open-source: Q4 2012
 Internal replication tier for Espresso
 Reduce latency further, scale to thousands of consumers
  per relay
    – Poll  Streaming
   Investigate alternate Oracle implementations
   Externalize joins outside the source
   User-defined functions
   Eventually-consistent systems
Three Takeaways

 Specialization in Data Systems
   – CDC pipeline is a first class infrastructure citizen up there with
     your stores and indexes
 Bootstrap Service
   – Isolates the source from abusive scans
   – Serves both streaming and batch use-cases
 Pull and External clock
   – Makes client application development simple
   – Fewer things can go wrong inside the pipeline




                                                                          19
Recruiting Solutions   ‹#›

Mais conteúdo relacionado

Mais procurados

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

Mais procurados (20)

Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL Analytics
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012Cassandra at eBay - Cassandra Summit 2012
Cassandra at eBay - Cassandra Summit 2012
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Data Vault and DW2.0
Data Vault and DW2.0Data Vault and DW2.0
Data Vault and DW2.0
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 

Destaque

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Shirshanka Das
 

Destaque (6)

Aksyon radyo
Aksyon radyoAksyon radyo
Aksyon radyo
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
 
SlideShare 101
SlideShare 101SlideShare 101
SlideShare 101
 

Semelhante a Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012

Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Michael Noel
 

Semelhante a Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012 (20)

All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to Databus
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 
Observability in real time at scale
Observability in real time at scaleObservability in real time at scale
Observability in real time at scale
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
 
Databus - Abhishek Bhargava & Maheswaran Veluchamy - DevOps Bangalore Meetup...
Databus - Abhishek Bhargava &  Maheswaran Veluchamy - DevOps Bangalore Meetup...Databus - Abhishek Bhargava &  Maheswaran Veluchamy - DevOps Bangalore Meetup...
Databus - Abhishek Bhargava & Maheswaran Veluchamy - DevOps Bangalore Meetup...
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Db trends final
Db trends   finalDb trends   final
Db trends final
 
OBIEE ARCHITECTURE.ppt
OBIEE ARCHITECTURE.pptOBIEE ARCHITECTURE.ppt
OBIEE ARCHITECTURE.ppt
 
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
 
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridTop 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data Grid
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High Costs
 
SharePoint Performance Optimization In 10 Steps for the IT Professional
SharePoint Performance Optimization In 10 Steps for the IT ProfessionalSharePoint Performance Optimization In 10 Steps for the IT Professional
SharePoint Performance Optimization In 10 Steps for the IT Professional
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
SnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark Meetup
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012

  • 1. All Aboard the Databus! LinkedIn’s Change Data Capture Pipeline ACM SOCC 2012 Oct 16th Databus Team @ LinkedIn Shirshanka Das http://www.linkedin.com/in/shirshankadas @shirshanka Recruiting Solutions
  • 2. The Consequence of Specialization in Data Systems Data Flow is essential Data Consistency is critical !!!
  • 3. The Timeline Consistent Data Flow problem
  • 4. Two Ways Application code dual Extract changes from writes to database and database commit log pub-sub system Easy on the surface Tough but possible Consistent? Consistent!!!
  • 5. The Result: Databus Standar Standar Standar Standar Standar Standar Standar Standar Updates Standar dization Search dization Graph dization Read dization dization dization dization Index dization Index dization Replicas Primary DB Data Change Events Databus 5
  • 6. Key Design Decisions : Semantics  Logical clocks attached to the source – Physical offsets are only used for internal transport – Simplifies data portability  Pull model – Restarts are simple – Derived State = f (Source state, Clock) – + Idempotence = Timeline Consistent! 6
  • 7. Key Design Decisions : Systems  Isolate fast consumers from slow consumers – Workload separation between online, catch-up, bootstrap  Isolate sources from consumers – Schema changes – Physical layout changes – Speed mismatch  Schema-aware – Filtering, Projections – Typically network-bound  can burn more CPU 7
  • 8. Databus: First attempt (2007) Issues  Source database pressure caused by slow consumers  Brittle serialization
  • 9. Current Architecture (2011) Four Logical Components  Fetcher – Fetch from db, relay…  Log Store – Store log snippet  Snapshot Store – Store moving data snapshot  Subscription Client – Orchestrate pull across these
  • 10. The Relay  Change event buffering (~ 2 – 7 days)  Low latency (10-15 ms)  Filtering, Projection  Hundreds of consumers per relay  Scale-out, High-availability through redundancy Option 1: Peered Deployment Option 2: Clustered Deployment
  • 11. The Bootstrap Service  Catch-all for slow / new consumers  Isolate source OLTP instance from large scans  Log Store + Snapshot Store  Optimizations – Periodic merge – Predicate push-down – Catch-up versus full bootstrap  Guaranteed progress for consumers via chunking  Implementations – Database (MySQL) – Raw Files  Bridges the continuum between stream and batch systems
  • 12. The Consumer Client Library  Glue between Databus infra and business logic in the consumer  Switches between relay and bootstrap as needed  API – Callback with transactions – Iterators over windows
  • 13. Fetcher Implementations  Oracle – Trigger-based (see paper for details)  MySQL – Custom-storage-engine based (see paper for details)  In Labs – Alternative implementations for Oracle – OpenReplicator integration for MySQL
  • 14. Meta-data Management  Event definition, serialization and transport – Avro  Oracle, MySQL – Table schema generates Avro definition  Schema evolution – Only backwards-compatible changes allowed  Isolation between upgrades on producer and consumer
  • 15. Partitioning the Stream  Server-side filtering – Range, mod, hash – Allows client to control partitioning function  Consumer groups – Distribute partitions evenly across a group – Move partitions to available consumers on failure – Minimize re-processing
  • 16. Experience in Production: The Good  Source isolation: Bootstrap benefits – Typically, data extracted from sources just once – Bootstrap service routinely used to satisfy new or slow consumers  Common Data Format – Early versions used hand-written Java classes for schema  Too brittle – Java classes also meant many different serializations for versions of the classes – Avro offers ease-of-use flexibility & performance improvements (no re-marshaling)  Rich Subscription Support – Example: Search, Relevance
  • 17. Experience in Production: The Bad  Oracle Fetcher Performance Bottlenecks – Complex joins – BLOBS and CLOBS – High update rate driven contention on trigger table  Bootstrap: Snapshot store seeding – Consistent snapshot extraction from large sources – Complex joins hurt when trying to create exactly the same results
  • 18. What’s Next?  Open-source: Q4 2012  Internal replication tier for Espresso  Reduce latency further, scale to thousands of consumers per relay – Poll  Streaming  Investigate alternate Oracle implementations  Externalize joins outside the source  User-defined functions  Eventually-consistent systems
  • 19. Three Takeaways  Specialization in Data Systems – CDC pipeline is a first class infrastructure citizen up there with your stores and indexes  Bootstrap Service – Isolates the source from abusive scans – Serves both streaming and batch use-cases  Pull and External clock – Makes client application development simple – Fewer things can go wrong inside the pipeline 19

Notas do Editor

  1. Batch systems can consume the raw snapshots directly.