SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Apache BookKeeper
DISTRIBUTED STORE
a Salesforce Use Case
Venkateswararao Jujjuri (JV)
Cloud Storage Architect
vjujjuri@salesforce.com
jujjuri@gmail.com
@jvjujjuri | Twitter
https://www.linkedin.com/in/jvjujjuri
Agenda
​ Salesforce needs and requirements
​ Hunt and Selection
​ BookKeeper Introduction
​ Improvements and Enhancements
​ As Service at Scale @ Salesforce
​ Performance
​ Community
​ Q & A
Salesforce Application Storage Needs
​ Store for Persistent WAL, data, and objects
​ Low, constant write latencies
•  Transaction Log, Smaller writes
​ Low, constant Random Read latencies
​ Highly available
​ Append Only entries
•  Objects
​ Highly Consistent for immutable data
​ Long Term Storage
​ Distributed and linearly scalable.
​ On commodity hardware
​ Low Operating Cost
What Did we consider?
​ Build vs. Buy
•  Time-To-Market, resources, cost.
​ Finalists
•  Ceph
•  A CP System
•  w/Unreliable reads, Read path can behave like an AP system.
•  Lot of effort to make it AP behavior on write path
•  Remember: Immutable data.
•  BookKeeper
•  CAP system, because of immutable/append only data.
•  Came close to what we want
•  Almost there but not everything.
Apache Bookkeeper
​ A highly consistent, available, replicated, distributed log service.
​ Immutable , append only store.
​ Thick Client, Simple and Elegant placement policy
•  No Central Master
•  No complicated hashing/computing for placement
​ Low latency, both on writes and reads.
​ Runs on commodity hardware.
​ Built for WAL use-case, but can be expanded to broader storage needs
​ Uses ZooKeeper as consensuses service, and metadata store.
​ Awesome Community.
Enter Apache BookKeeper
Apache BookKeeper
​ A system to reliably log streams of records.
​ Is designed to store write ahead logs for database like applications.
​ Inspired by and designed to solve HDFS NameNode availability deficiencies.
​ Opensource Chronology
•  2008 Open Sourced contribution to ZooKeeper
•  2011 Sub-Project of ZooKeeper.
•  2012 Production
Terminology
​ Journal: Write ahead log
​ Ledger: Log Stream
​ Entry: Each entry of log stream
​ Client: Library, with the application.
​ Bookie: Server
​ Ensemble: Set of Bookies across which a ledger is striped.
​ Cluster: All bookies belong to a given instance of Bookkeeper
​ Write Quorum Size: Number of replicas.
​ Ack Quorum Size: Number of responses needed before client’s write is satisfied.
​ LAC: Last Add Confirmed.
Major Components
• Thick Client; Carries heavy weight in the protocol.
• Thin Server, Bookie. Bookies never initiate any interaction with ZooKeeper or fellow Bookies.
• Zookeeper monitors Bookies.
• Metadata is stored on Zookeeper.
• Auditor to monitor bookies and identify under replicated ledgers.
• Replication workers to replicate under replicated ledger copies.
Major Components
Create Ledger
• Gets Writer Ledger Handle
Add an entry to the Ledger
• Write To the Ledger
Open Ledger
• Gives ReadOnly Ledger Handle.
• May ask for non-recovery read handle.
Get an entry from the ledger
• Read from the ledger
Close ledger
Delete Ledger
Basic Operations
Salesforce Application with BookKeeper
Application
Store Interface
With
Bookkeeper client User
Library
Bookies ZooKeeper
Server Machine
Guarantees
• If an entry has been acknowledged, it must be readable.
• If an entry is read once, it must always be readable.
• If write of entryID ‘n’ is successful, all entries until ‘n’ are successfully committed.
Consistencies
• Last Add Confirmed is consistency among readers
• Fence is consistency among writers.
Commitment
Out-of-order write and In-Order Ack.
• Application has liberty to pre-allocate entryIDs
• Multiple application threads can write in parallel.
User defined Ledger Names
• Not restricted by BK generated ledger Names
Explicit LAC updates
• Added ReadLac, WriteLac to the protocol.
• Maintain both piggy-back LAC and explicit LAC simultaneously.
Enhancements - In the internal branch working to push upstream
Conventional Name Space.
• User defined Names
• Treat LedgerId as an i-node in a file system.
Disk scrubbers and Repairs
• Actively hunt and repair bit-rots and corruptions
Scalable Metadata Store
• Separate and dedicated metadata store
• Not restricted by ZK limitations
Enhancements - Future
Out of order write and in order Ack
0 1 2 3 4 5
App A ( Writer )
6
App B ( Writer )
8
App C ( Writer )
7
Last Add Confirmed
0 1 2 3 4 5
App A ( Writer )
6
App B ( Writer )
8
App C ( Writer )
7
LAC LAC
App D (Reader)
X
LAC
Things Do Break
What Can Happen?
Client
•  Client Restarts
•  Client loses connection with zookeeper
•  Client loses connection with bookies.
Bookie
• Bookie Goes down
• Disk(s) on bookie go bad, IO issues
• Bookie gets disconnected from network.
Zookeeper
• Gets disconnected from rest of the cluster
Writing Client Crash
bookie
bookie
bookie
zookeeper
What is the last entry?
•  Nothing happens until a reader attempts to
read.
•  Recovery process gets initiated when a
process opens the ledger for reading.
•  Close the ledger on zoo keeper
•  Identify Last entry of the ledger.
•  Update metadata on zookeeper with
Last Add Confirmed. (LAC)
Client gets disconnected with Bookies.
Either bookie is down or network between client and bookie have issues.
Contact zoo keeper to get the list of available bookies.
Update ensemble set, register with zookeeper.
Continue with new set.
Client gets disconnected with Zookeeper.
Tries to reestablish the connection.
Can continue to read and write to the ledger.
Until that time, no metadata operations can be performed.
•  Can not create a ledger
•  Can not open a ledger
•  Can not close a ledger
Reader Opens while writer is active.
Application control
BK guarantees correctness.
Reader initiates recovery process.
•  Fences bookie on the zookeeper.
•  Informs all bookies in ensemble recovery started.
•  After these steps writer will get write errors.(if actively writing)
•  Reader contacts all bookies to learn last entry.
•  Replicates last entry if it doesn’t have enough replicas.
•  Updates zookeeper with LAC, and closes the ledger.
Recovery begins when the ledger is opened by the reader in recovery mode
• Check if the ledger needs recovery (not closed)
• Fence the ledger first and initiate recovery
• Step1: Flag that the ledger is in recovery by update ZooKeeper state.
• Step2 : Fence Bookies
• Step3 : Recover the Ledger
Fencing and Recovery
Ledger Fencing
BookKeeper
Distributed Store
Ledger
Write Non Recovery Read
Recovery ReadFence & Recover
Attempt to write
ZooKeeper
Cluster
B
Auto Recovery Components
Bookie-1 Bookie-2 Bookie-N
BookKeeper
Cluster
Auditor (Lead)
Replicator
Worker
Auditor
(Follower)
Replicator
Worker
Auditor
(Follower)
Replicator
Worker
Machine-1 Machine-2 Machine-N
Auditor
• Starts on every Bookie machine, leader gets elected through ZooKeeper.
• One active auditor per cluster.
• Watch Bookie failures and manage under replicated ledgers list.
Replication Workers
• Responsible for performing replication to maintain quorum copies.
• Can run on any machine in the cluster, usually runs on each Bookie machine.
• Work on under replicated ledgers list published by the Auditor.
• Pick one ledger at a time, create a lock on ZooKeeper and replicate to local bookie.
• If local bookie is part of the ensemble, drop the lock and move to next one in the list.
Bookie Crashes - Auto Recovery
Heterogeneous Stores and Tiered Architecture
Log Store
Data Store
Archival Store
Clusters of storage serving App Instances
Log Store
Data Store
Archival Store
App Instance
App Instance App Instance
App Instance
App Instance
App Instance
App Instance
App Instance
Performance
Performance
Performance
Community Update
Projects built on BookKeeper
•  Twitter Distributed Log : Manhattan, Pub/Sub, DeferredRPC
•  Yahoo Cloud Messaging
•  Salesforce Distributed Store.
•  Huawei – HDFS NameNode
•  HubSpot – WAL
•  Majordodo – Distributed Resource Manager
Community
•  6 PMC members
•  8 Committers
•  20-25 active members
•  5 Enterprises actively using/contributing
More Info
https://cwiki.apache.org/confluence/display/BOOKKEEPER/BookKeeper+papers+and+presentations

Mais conteúdo relacionado

Mais procurados

Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
StreamNative
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
confluent
 

Mais procurados (20)

Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
 
Apache pulsar - storage architecture
Apache pulsar - storage architectureApache pulsar - storage architecture
Apache pulsar - storage architecture
 
Pulsar - flexible pub-sub for internet scale
Pulsar - flexible pub-sub for internet scalePulsar - flexible pub-sub for internet scale
Pulsar - flexible pub-sub for internet scale
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
 
Effectively-once semantics in Apache Pulsar
Effectively-once semantics in Apache PulsarEffectively-once semantics in Apache Pulsar
Effectively-once semantics in Apache Pulsar
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Integrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data EcosystemIntegrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data Ecosystem
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
Devoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaDevoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with Kafka
 
A Unified Platform for Real-time Storage and Processing
A Unified Platform for Real-time Storage and ProcessingA Unified Platform for Real-time Storage and Processing
A Unified Platform for Real-time Storage and Processing
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka aws
Kafka awsKafka aws
Kafka aws
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
 
Design Patterns for working with Fast Data
Design Patterns for working with Fast DataDesign Patterns for working with Fast Data
Design Patterns for working with Fast Data
 

Semelhante a Apache con2016final

Semelhante a Apache con2016final (20)

Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
 
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)
 
How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021How Pulsar Stores Your Data - Pulsar Summit NA 2021
How Pulsar Stores Your Data - Pulsar Summit NA 2021
 
Monitoring Apache Kafka
Monitoring Apache KafkaMonitoring Apache Kafka
Monitoring Apache Kafka
 
NASIG 2021 Don't wait automate! Industry perspectives on KBART automation
NASIG 2021   Don't wait automate! Industry perspectives on KBART automationNASIG 2021   Don't wait automate! Industry perspectives on KBART automation
NASIG 2021 Don't wait automate! Industry perspectives on KBART automation
 
Kafka PPT.pptx
Kafka PPT.pptxKafka PPT.pptx
Kafka PPT.pptx
 
Spring Batch Introduction (and Bitbucket Project)
Spring Batch Introduction (and Bitbucket Project)Spring Batch Introduction (and Bitbucket Project)
Spring Batch Introduction (and Bitbucket Project)
 
kafka simplicity and complexity
kafka simplicity and complexitykafka simplicity and complexity
kafka simplicity and complexity
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Cashing in on logging and exception data
Cashing in on logging and exception dataCashing in on logging and exception data
Cashing in on logging and exception data
 
Kafka as a Datastore
Kafka as a DatastoreKafka as a Datastore
Kafka as a Datastore
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
Apache Bookkeeper and Apache Zookeeper for Apache Pulsar
Apache Bookkeeper and Apache Zookeeper for Apache PulsarApache Bookkeeper and Apache Zookeeper for Apache Pulsar
Apache Bookkeeper and Apache Zookeeper for Apache Pulsar
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
 
Introduce flux & react in practice
Introduce flux & react in practiceIntroduce flux & react in practice
Introduce flux & react in practice
 

Último

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 

Último (20)

Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

Apache con2016final

  • 1. Apache BookKeeper DISTRIBUTED STORE a Salesforce Use Case Venkateswararao Jujjuri (JV) Cloud Storage Architect vjujjuri@salesforce.com jujjuri@gmail.com @jvjujjuri | Twitter https://www.linkedin.com/in/jvjujjuri
  • 2. Agenda ​ Salesforce needs and requirements ​ Hunt and Selection ​ BookKeeper Introduction ​ Improvements and Enhancements ​ As Service at Scale @ Salesforce ​ Performance ​ Community ​ Q & A
  • 3. Salesforce Application Storage Needs ​ Store for Persistent WAL, data, and objects ​ Low, constant write latencies •  Transaction Log, Smaller writes ​ Low, constant Random Read latencies ​ Highly available ​ Append Only entries •  Objects ​ Highly Consistent for immutable data ​ Long Term Storage ​ Distributed and linearly scalable. ​ On commodity hardware ​ Low Operating Cost
  • 4. What Did we consider? ​ Build vs. Buy •  Time-To-Market, resources, cost. ​ Finalists •  Ceph •  A CP System •  w/Unreliable reads, Read path can behave like an AP system. •  Lot of effort to make it AP behavior on write path •  Remember: Immutable data. •  BookKeeper •  CAP system, because of immutable/append only data. •  Came close to what we want •  Almost there but not everything.
  • 5. Apache Bookkeeper ​ A highly consistent, available, replicated, distributed log service. ​ Immutable , append only store. ​ Thick Client, Simple and Elegant placement policy •  No Central Master •  No complicated hashing/computing for placement ​ Low latency, both on writes and reads. ​ Runs on commodity hardware. ​ Built for WAL use-case, but can be expanded to broader storage needs ​ Uses ZooKeeper as consensuses service, and metadata store. ​ Awesome Community.
  • 7. Apache BookKeeper ​ A system to reliably log streams of records. ​ Is designed to store write ahead logs for database like applications. ​ Inspired by and designed to solve HDFS NameNode availability deficiencies. ​ Opensource Chronology •  2008 Open Sourced contribution to ZooKeeper •  2011 Sub-Project of ZooKeeper. •  2012 Production
  • 8. Terminology ​ Journal: Write ahead log ​ Ledger: Log Stream ​ Entry: Each entry of log stream ​ Client: Library, with the application. ​ Bookie: Server ​ Ensemble: Set of Bookies across which a ledger is striped. ​ Cluster: All bookies belong to a given instance of Bookkeeper ​ Write Quorum Size: Number of replicas. ​ Ack Quorum Size: Number of responses needed before client’s write is satisfied. ​ LAC: Last Add Confirmed.
  • 9. Major Components • Thick Client; Carries heavy weight in the protocol. • Thin Server, Bookie. Bookies never initiate any interaction with ZooKeeper or fellow Bookies. • Zookeeper monitors Bookies. • Metadata is stored on Zookeeper. • Auditor to monitor bookies and identify under replicated ledgers. • Replication workers to replicate under replicated ledger copies. Major Components
  • 10. Create Ledger • Gets Writer Ledger Handle Add an entry to the Ledger • Write To the Ledger Open Ledger • Gives ReadOnly Ledger Handle. • May ask for non-recovery read handle. Get an entry from the ledger • Read from the ledger Close ledger Delete Ledger Basic Operations
  • 11. Salesforce Application with BookKeeper Application Store Interface With Bookkeeper client User Library Bookies ZooKeeper Server Machine
  • 12. Guarantees • If an entry has been acknowledged, it must be readable. • If an entry is read once, it must always be readable. • If write of entryID ‘n’ is successful, all entries until ‘n’ are successfully committed. Consistencies • Last Add Confirmed is consistency among readers • Fence is consistency among writers. Commitment
  • 13. Out-of-order write and In-Order Ack. • Application has liberty to pre-allocate entryIDs • Multiple application threads can write in parallel. User defined Ledger Names • Not restricted by BK generated ledger Names Explicit LAC updates • Added ReadLac, WriteLac to the protocol. • Maintain both piggy-back LAC and explicit LAC simultaneously. Enhancements - In the internal branch working to push upstream
  • 14. Conventional Name Space. • User defined Names • Treat LedgerId as an i-node in a file system. Disk scrubbers and Repairs • Actively hunt and repair bit-rots and corruptions Scalable Metadata Store • Separate and dedicated metadata store • Not restricted by ZK limitations Enhancements - Future
  • 15. Out of order write and in order Ack 0 1 2 3 4 5 App A ( Writer ) 6 App B ( Writer ) 8 App C ( Writer ) 7
  • 16. Last Add Confirmed 0 1 2 3 4 5 App A ( Writer ) 6 App B ( Writer ) 8 App C ( Writer ) 7 LAC LAC App D (Reader) X LAC
  • 18. What Can Happen? Client •  Client Restarts •  Client loses connection with zookeeper •  Client loses connection with bookies. Bookie • Bookie Goes down • Disk(s) on bookie go bad, IO issues • Bookie gets disconnected from network. Zookeeper • Gets disconnected from rest of the cluster
  • 19. Writing Client Crash bookie bookie bookie zookeeper What is the last entry? •  Nothing happens until a reader attempts to read. •  Recovery process gets initiated when a process opens the ledger for reading. •  Close the ledger on zoo keeper •  Identify Last entry of the ledger. •  Update metadata on zookeeper with Last Add Confirmed. (LAC)
  • 20. Client gets disconnected with Bookies. Either bookie is down or network between client and bookie have issues. Contact zoo keeper to get the list of available bookies. Update ensemble set, register with zookeeper. Continue with new set.
  • 21. Client gets disconnected with Zookeeper. Tries to reestablish the connection. Can continue to read and write to the ledger. Until that time, no metadata operations can be performed. •  Can not create a ledger •  Can not open a ledger •  Can not close a ledger
  • 22. Reader Opens while writer is active. Application control BK guarantees correctness. Reader initiates recovery process. •  Fences bookie on the zookeeper. •  Informs all bookies in ensemble recovery started. •  After these steps writer will get write errors.(if actively writing) •  Reader contacts all bookies to learn last entry. •  Replicates last entry if it doesn’t have enough replicas. •  Updates zookeeper with LAC, and closes the ledger.
  • 23. Recovery begins when the ledger is opened by the reader in recovery mode • Check if the ledger needs recovery (not closed) • Fence the ledger first and initiate recovery • Step1: Flag that the ledger is in recovery by update ZooKeeper state. • Step2 : Fence Bookies • Step3 : Recover the Ledger Fencing and Recovery
  • 24. Ledger Fencing BookKeeper Distributed Store Ledger Write Non Recovery Read Recovery ReadFence & Recover Attempt to write
  • 25. ZooKeeper Cluster B Auto Recovery Components Bookie-1 Bookie-2 Bookie-N BookKeeper Cluster Auditor (Lead) Replicator Worker Auditor (Follower) Replicator Worker Auditor (Follower) Replicator Worker Machine-1 Machine-2 Machine-N
  • 26. Auditor • Starts on every Bookie machine, leader gets elected through ZooKeeper. • One active auditor per cluster. • Watch Bookie failures and manage under replicated ledgers list. Replication Workers • Responsible for performing replication to maintain quorum copies. • Can run on any machine in the cluster, usually runs on each Bookie machine. • Work on under replicated ledgers list published by the Auditor. • Pick one ledger at a time, create a lock on ZooKeeper and replicate to local bookie. • If local bookie is part of the ensemble, drop the lock and move to next one in the list. Bookie Crashes - Auto Recovery
  • 27. Heterogeneous Stores and Tiered Architecture Log Store Data Store Archival Store
  • 28. Clusters of storage serving App Instances Log Store Data Store Archival Store App Instance App Instance App Instance App Instance App Instance App Instance App Instance App Instance
  • 32. Community Update Projects built on BookKeeper •  Twitter Distributed Log : Manhattan, Pub/Sub, DeferredRPC •  Yahoo Cloud Messaging •  Salesforce Distributed Store. •  Huawei – HDFS NameNode •  HubSpot – WAL •  Majordodo – Distributed Resource Manager Community •  6 PMC members •  8 Committers •  20-25 active members •  5 Enterprises actively using/contributing More Info https://cwiki.apache.org/confluence/display/BOOKKEEPER/BookKeeper+papers+and+presentations