SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Arbiter volumes in gluster
Ravishankar N.
Gluster developer
@itisravi
2
Agenda
➢ Introduction to gluster
➢ Replicate (AFR) volumes in gluster
➢ Split-brains in replica volumes
➢Client-quorum to the rescue
➢Arbiter volumes
➢Volume creation
➢How they work
➢Brick sizing and placement strategies
➢Monitoring and troubleshooting
3
Introduction to Gluster
Image courtesy: goo.gl/pUMrxS
Know thy keywords
• gluster server
• bricks
• Peers, trusted storage pool
• volume
• gluster client
• access protocols
• volume options
• translators
• graphs
• gfid
• glusterfsd, glustershd, nfs,
glusterd, snapd, bitd
4
A 4x1 plain distribute volume
5
Replicate (AFR) volumes in gluster
●Writes are synchronous- i.e. sent to all bricks
of the replica.
●Follows a transaction model under locks –
non-blocking locks that degrade to blocking.
●Uses extended attributes to mark failures:
#getfattr -d -m . -e hex /brick1/file.txt
getfattr: Removing leading '/' from absolute path
names
# file: bricks/brick1/file.txt
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol-client-
2=0x000000020000000000000000
trusted.gfid=0xde0d499090a144ffa03841b7b317d052
6
(...cont'd) Introduction to replicate (AFR) volumes in gluster
●Reads - served from one brick (the 'read-
subvolume') of the replica.
●The read-subvolume is a function of hash(file
gfid) -but is also configurable via policies.
7
(...cont'd) Introduction to replicate (AFR) volumes in gluster
Self-heal
● GFIDs of files that need heal are stored inside .glusterfs/indices folder of the
bricks.
● Self-heal daemon crawls this folder periodically:
- For each GFID encountered, it fetches the xattrs of the file, finds out the source(s)
and sink(s), does the heal.
● Manual launching of heal – via gluster CLI:
#gluster volume heal <volname>
8
Split-brains in replica volumes
● What is split-brain?
– Difference in file data/ metadata across the bricks of a replica.
– Cannot identify which brick holds the good copy, even when all bricks are
available.
– Each brick accuses the other of needing heal.
– All modification FOPs fail with Input/Output Error (EIO)
9
Split-brains in replica volumes
● How does it happen?
–Split-brain in space
Two clients write to different replicas
of the file leading to inconsistencies.
10
(...cont'd) Split-brains in replica volumes
● How does it happen?
–Split-brain in time
The same client writes to different
replicas of the file at different times before
the healing begins.
11
(...cont'd) Split-brains in replica volumes
● How do you prevent it?
● Theoretically speaking, use robust networks with zero {disconnects,
packet losses, hardware issues.}
But if wishes were horses...
12
Client-quorum to the rescue
● quorum  kw r- mˈ o ə (noun): the smallest number of people who
must be present at a meeting in order for decisions to be made.
● In glusterFS parlance: Minimum number of bricks that need to be
up to allow modifications.
● What does this mean for replica-2?
– The client need to be connected to both bricks at all times.
– Fault tolerance = No down bricks. i.e. no high availability
13
(...cont'd) Client-quorum to the rescue
➢ How does replica-3 solve the problem?
–Uses client quorum in 'auto' mode - i.e. 2 bricks need to be up to allow modifications.
–In other words, fault tolerance = 1 down brick
14
(...cont'd) Client-quorum to the rescue
●Network split-brain – A corner case in replica-3 volumes.
●3 Clients simultaneously write to non-overlapping (offset +
range):
C1 succeeds on B1, B2
C2 succeeds on B2, B3
C3 succeeds on B1, B3
15
(...cont'd) Client-quorum to the rescue
➢ Key point to understand - to prevent split-brain, there will always be a
case when you need to block FOPS, no matter how many replicas are
used, if the only true copy is unavailable.
➢ What extra replicas give you is to increase in chance that more than one
brick hosts the true copy (i.e. the copy has witnessed all writes without
any failures)
16
Arbiter volumes
➢ A replica-3 volume where the 3rd brick only stores file metadata and
the directory hierarchy.
17
(...cont'd) Arbiter volumes
➢ Consumes much less than 3x space - But *how much* space?
We'll see later.
➢ Provides the same level of consistency (not availability) as replica-
3 volumes
–i.e. it allows a fault tolerance of 1 down brick.
➢ Perfect sweet spot between 2 way and 3 way replica.
18
Creating Arbiter volumes
#gluster volume create testvol replica 3 arbiter 1 host1:brick1
host2:brick2 host3:brick3
#gluster volume info testvol
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: ae6c4162-38c2-4368-ae5d-6bad141a4119
Status: Created
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: host1:/bricks/brick1
Brick2: host2:/bricks/brick2
Brick3: host3:/bricks/brick3 (arbiter)
Brick4: host4:/bricks/brick4
Brick5: host5:/bricks/brick5
Brick6: host6:/bricks/brick6 (arbiter)
19
Arbiter volumes- How do they work?
➢ The arbiter translator on
(every 3rd) brick process.
➢ The arbitration logic in
AFR in the client process.
20
Arbiter volumes- How do they work?
Arbitration logic:
➢ Take full locks for writes.
➢ Decide go/no-go before sending the write to the bricks.
➢ Decide to return success/failure after we get the responses from the
bricks.
Scenarios:
a)all 3 bricks (including arbiter) are up → allow writes
b)2 bricks are up, arbiter is down →allow writes
c)1 brick and arbiter are up → allow writes IFF the arbiter doesn't blame
the up brick.
21
Arbiter volumes- How do they work?
➢ Self-heal mechanism
- Arbiter can serve as a source for metadata (but not data)
➢ What about performance?
-Single writer - No penalty.
-Multi writer - A little overhead due to full file locks.
22
Brick sizing and placement
● What does each arbiter brick store?
– Directory hierarchy + xattrs
– The .glusterfs folder
● Safer side estimate is 4KB/file x no. of files
Practical estimates by community users: 1KB/file x no. of files
https://gist.github.com/pfactum/e8265ca07f7b19f30bb3
● Placement strategies
– Low spec machine for holding arbiter bricks of all replicas.
– Daisy chaining: Place arbiter bricks of different replicas in different nodes that host
the data bricks.
23
Monitoring volumes
● ENOTCONN messages in client logs
Need not necessarily mean a lost connection. Can be due to the
arbitration logic to prevent split-brains.
● #gluster volume heal volname info -> Monitor heals
● #gluster volume heal volname info split-brain
-->Must always shows zero entries (Why? Because Arbiter
volumes duh!)
--> But it doesn't? Report a bug!
24
Epilogue
● More information on arbiter volumes:
http://gluster.readthedocs.io/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/
● Gluster upstream mailing lists:https://www.gluster.org/mailman/listinfo
● IRC: #gluster-users and #gluster-devel on freenode
25
Questions/ comments?
Thank you!

Mais conteúdo relacionado

Mais procurados

What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about cephEmma Haruka Iwao
 
Az ve Öz C++ Muhammet ÇAĞATAY
Az ve Öz C++  Muhammet ÇAĞATAYAz ve Öz C++  Muhammet ÇAĞATAY
Az ve Öz C++ Muhammet ÇAĞATAYMuhammet ÇAĞATAY
 
Postgres Playground で pgbench を走らせよう!(第35回PostgreSQLアンカンファレンス@オンライン 発表資料)
Postgres Playground で pgbench を走らせよう!(第35回PostgreSQLアンカンファレンス@オンライン 発表資料)Postgres Playground で pgbench を走らせよう!(第35回PostgreSQLアンカンファレンス@オンライン 発表資料)
Postgres Playground で pgbench を走らせよう!(第35回PostgreSQLアンカンファレンス@オンライン 発表資料)NTT DATA Technology & Innovation
 
DPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway ApplicationDPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway ApplicationMichelle Holley
 
PostgreSQLのトラブルシューティング@第5回中国地方DB勉強会
PostgreSQLのトラブルシューティング@第5回中国地方DB勉強会PostgreSQLのトラブルシューティング@第5回中国地方DB勉強会
PostgreSQLのトラブルシューティング@第5回中国地方DB勉強会Shigeru Hanada
 
Kubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesKubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesLaurent Bernaille
 
Do you know what your drupal is doing? Observe it!
Do you know what your drupal is doing? Observe it!Do you know what your drupal is doing? Observe it!
Do you know what your drupal is doing? Observe it!Luca Lusso
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 
OSSデータベースの開発コミュニティに参加しよう! (DEIM2024 発表資料)
OSSデータベースの開発コミュニティに参加しよう! (DEIM2024 発表資料)OSSデータベースの開発コミュニティに参加しよう! (DEIM2024 発表資料)
OSSデータベースの開発コミュニティに参加しよう! (DEIM2024 発表資料)NTT DATA Technology & Innovation
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 
Embedded TCP/IP stack for FreeRTOS
Embedded TCP/IP stack for FreeRTOSEmbedded TCP/IP stack for FreeRTOS
Embedded TCP/IP stack for FreeRTOS艾鍗科技
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanCeph Community
 
Deploying Foreman in Enterprise Environments
Deploying Foreman in Enterprise EnvironmentsDeploying Foreman in Enterprise Environments
Deploying Foreman in Enterprise Environmentsinovex GmbH
 
バックアップことはじめ JPUG第29回しくみ+アプリケーション分科会(2014-05-31)
バックアップことはじめ JPUG第29回しくみ+アプリケーション分科会(2014-05-31)バックアップことはじめ JPUG第29回しくみ+アプリケーション分科会(2014-05-31)
バックアップことはじめ JPUG第29回しくみ+アプリケーション分科会(2014-05-31)Chika SATO
 
ただしくHTTPSを設定しよう!
ただしくHTTPSを設定しよう!ただしくHTTPSを設定しよう!
ただしくHTTPSを設定しよう!IIJ
 
PostgreSQL13でのpg_basebackupの改善について(第13回PostgreSQLアンカンファレンス@オンライン)
PostgreSQL13でのpg_basebackupの改善について(第13回PostgreSQLアンカンファレンス@オンライン)PostgreSQL13でのpg_basebackupの改善について(第13回PostgreSQLアンカンファレンス@オンライン)
PostgreSQL13でのpg_basebackupの改善について(第13回PostgreSQLアンカンファレンス@オンライン)NTT DATA Technology & Innovation
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...NAVER D2
 

Mais procurados (20)

What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 
Az ve Öz C++ Muhammet ÇAĞATAY
Az ve Öz C++  Muhammet ÇAĞATAYAz ve Öz C++  Muhammet ÇAĞATAY
Az ve Öz C++ Muhammet ÇAĞATAY
 
Ex200
Ex200Ex200
Ex200
 
Postgres Playground で pgbench を走らせよう!(第35回PostgreSQLアンカンファレンス@オンライン 発表資料)
Postgres Playground で pgbench を走らせよう!(第35回PostgreSQLアンカンファレンス@オンライン 発表資料)Postgres Playground で pgbench を走らせよう!(第35回PostgreSQLアンカンファレンス@オンライン 発表資料)
Postgres Playground で pgbench を走らせよう!(第35回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
DPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway ApplicationDPDK IPSec Security Gateway Application
DPDK IPSec Security Gateway Application
 
PostgreSQLのトラブルシューティング@第5回中国地方DB勉強会
PostgreSQLのトラブルシューティング@第5回中国地方DB勉強会PostgreSQLのトラブルシューティング@第5回中国地方DB勉強会
PostgreSQLのトラブルシューティング@第5回中国地方DB勉強会
 
Kubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesKubernetes DNS Horror Stories
Kubernetes DNS Horror Stories
 
PostgreSQLコミュニティに飛び込もう
PostgreSQLコミュニティに飛び込もうPostgreSQLコミュニティに飛び込もう
PostgreSQLコミュニティに飛び込もう
 
Do you know what your drupal is doing? Observe it!
Do you know what your drupal is doing? Observe it!Do you know what your drupal is doing? Observe it!
Do you know what your drupal is doing? Observe it!
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
OSSデータベースの開発コミュニティに参加しよう! (DEIM2024 発表資料)
OSSデータベースの開発コミュニティに参加しよう! (DEIM2024 発表資料)OSSデータベースの開発コミュニティに参加しよう! (DEIM2024 発表資料)
OSSデータベースの開発コミュニティに参加しよう! (DEIM2024 発表資料)
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Embedded TCP/IP stack for FreeRTOS
Embedded TCP/IP stack for FreeRTOSEmbedded TCP/IP stack for FreeRTOS
Embedded TCP/IP stack for FreeRTOS
 
RBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason DillamanRBD: What will the future bring? - Jason Dillaman
RBD: What will the future bring? - Jason Dillaman
 
Deploying Foreman in Enterprise Environments
Deploying Foreman in Enterprise EnvironmentsDeploying Foreman in Enterprise Environments
Deploying Foreman in Enterprise Environments
 
バックアップことはじめ JPUG第29回しくみ+アプリケーション分科会(2014-05-31)
バックアップことはじめ JPUG第29回しくみ+アプリケーション分科会(2014-05-31)バックアップことはじめ JPUG第29回しくみ+アプリケーション分科会(2014-05-31)
バックアップことはじめ JPUG第29回しくみ+アプリケーション分科会(2014-05-31)
 
ただしくHTTPSを設定しよう!
ただしくHTTPSを設定しよう!ただしくHTTPSを設定しよう!
ただしくHTTPSを設定しよう!
 
PostgreSQL13でのpg_basebackupの改善について(第13回PostgreSQLアンカンファレンス@オンライン)
PostgreSQL13でのpg_basebackupの改善について(第13回PostgreSQLアンカンファレンス@オンライン)PostgreSQL13でのpg_basebackupの改善について(第13回PostgreSQLアンカンファレンス@オンライン)
PostgreSQL13でのpg_basebackupの改善について(第13回PostgreSQLアンカンファレンス@オンライン)
 
Hock Immunization
Hock ImmunizationHock Immunization
Hock Immunization
 
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
[233] 대형 컨테이너 클러스터에서의 고가용성 Network Load Balancing: Maglev Hashing Scheduler i...
 

Semelhante a Arbiter volumes in gluster

Red Hat Gluster Storage : GlusterFS
Red Hat Gluster Storage : GlusterFSRed Hat Gluster Storage : GlusterFS
Red Hat Gluster Storage : GlusterFSbipin kunal
 
Gluster Storage
Gluster StorageGluster Storage
Gluster StorageRaz Tamir
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaAvinash Ramineni
 
Google File System
Google File SystemGoogle File System
Google File SystemDreamJobs1
 
Gluster fs architecture_&amp;_roadmap-vijay_bellur-linuxcon_eu_2013
Gluster fs architecture_&amp;_roadmap-vijay_bellur-linuxcon_eu_2013Gluster fs architecture_&amp;_roadmap-vijay_bellur-linuxcon_eu_2013
Gluster fs architecture_&amp;_roadmap-vijay_bellur-linuxcon_eu_2013Gluster.org
 
Glusterfs and openstack
Glusterfs  and openstackGlusterfs  and openstack
Glusterfs and openstackopenstackindia
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013Gluster.org
 
Performance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fsPerformance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fsNeependra Khare
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmAnne Nicolas
 
Gluster fs hadoop_fifth-elephant
Gluster fs hadoop_fifth-elephantGluster fs hadoop_fifth-elephant
Gluster fs hadoop_fifth-elephantGluster.org
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)Romain Jacotin
 
GlusterFS Basics for OpenStack
GlusterFS Basics for OpenStackGlusterFS Basics for OpenStack
GlusterFS Basics for OpenStackCody Hill
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightGluster.org
 
google file system
google file systemgoogle file system
google file systemdiptipan
 

Semelhante a Arbiter volumes in gluster (20)

The Google file system
The Google file systemThe Google file system
The Google file system
 
Red Hat Gluster Storage : GlusterFS
Red Hat Gluster Storage : GlusterFSRed Hat Gluster Storage : GlusterFS
Red Hat Gluster Storage : GlusterFS
 
Gluster Storage
Gluster StorageGluster Storage
Gluster Storage
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
 
Google File System
Google File SystemGoogle File System
Google File System
 
Gluster fs architecture_&amp;_roadmap-vijay_bellur-linuxcon_eu_2013
Gluster fs architecture_&amp;_roadmap-vijay_bellur-linuxcon_eu_2013Gluster fs architecture_&amp;_roadmap-vijay_bellur-linuxcon_eu_2013
Gluster fs architecture_&amp;_roadmap-vijay_bellur-linuxcon_eu_2013
 
Glusterfs and openstack
Glusterfs  and openstackGlusterfs  and openstack
Glusterfs and openstack
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
 
Performance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fsPerformance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fs
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
Containers > VMs
Containers > VMsContainers > VMs
Containers > VMs
 
Inter-process communication on steroids
Inter-process communication on steroidsInter-process communication on steroids
Inter-process communication on steroids
 
Gluster fs hadoop_fifth-elephant
Gluster fs hadoop_fifth-elephantGluster fs hadoop_fifth-elephant
Gluster fs hadoop_fifth-elephant
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
 
GlusterFS Basics for OpenStack
GlusterFS Basics for OpenStackGlusterFS Basics for OpenStack
GlusterFS Basics for OpenStack
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 
google file system
google file systemgoogle file system
google file system
 
Lalit
LalitLalit
Lalit
 

Último

WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 

Último (20)

WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 

Arbiter volumes in gluster

  • 1. Arbiter volumes in gluster Ravishankar N. Gluster developer @itisravi
  • 2. 2 Agenda ➢ Introduction to gluster ➢ Replicate (AFR) volumes in gluster ➢ Split-brains in replica volumes ➢Client-quorum to the rescue ➢Arbiter volumes ➢Volume creation ➢How they work ➢Brick sizing and placement strategies ➢Monitoring and troubleshooting
  • 3. 3 Introduction to Gluster Image courtesy: goo.gl/pUMrxS Know thy keywords • gluster server • bricks • Peers, trusted storage pool • volume • gluster client • access protocols • volume options • translators • graphs • gfid • glusterfsd, glustershd, nfs, glusterd, snapd, bitd
  • 4. 4 A 4x1 plain distribute volume
  • 5. 5 Replicate (AFR) volumes in gluster ●Writes are synchronous- i.e. sent to all bricks of the replica. ●Follows a transaction model under locks – non-blocking locks that degrade to blocking. ●Uses extended attributes to mark failures: #getfattr -d -m . -e hex /brick1/file.txt getfattr: Removing leading '/' from absolute path names # file: bricks/brick1/file.txt trusted.afr.dirty=0x000000000000000000000000 trusted.afr.testvol-client- 2=0x000000020000000000000000 trusted.gfid=0xde0d499090a144ffa03841b7b317d052
  • 6. 6 (...cont'd) Introduction to replicate (AFR) volumes in gluster ●Reads - served from one brick (the 'read- subvolume') of the replica. ●The read-subvolume is a function of hash(file gfid) -but is also configurable via policies.
  • 7. 7 (...cont'd) Introduction to replicate (AFR) volumes in gluster Self-heal ● GFIDs of files that need heal are stored inside .glusterfs/indices folder of the bricks. ● Self-heal daemon crawls this folder periodically: - For each GFID encountered, it fetches the xattrs of the file, finds out the source(s) and sink(s), does the heal. ● Manual launching of heal – via gluster CLI: #gluster volume heal <volname>
  • 8. 8 Split-brains in replica volumes ● What is split-brain? – Difference in file data/ metadata across the bricks of a replica. – Cannot identify which brick holds the good copy, even when all bricks are available. – Each brick accuses the other of needing heal. – All modification FOPs fail with Input/Output Error (EIO)
  • 9. 9 Split-brains in replica volumes ● How does it happen? –Split-brain in space Two clients write to different replicas of the file leading to inconsistencies.
  • 10. 10 (...cont'd) Split-brains in replica volumes ● How does it happen? –Split-brain in time The same client writes to different replicas of the file at different times before the healing begins.
  • 11. 11 (...cont'd) Split-brains in replica volumes ● How do you prevent it? ● Theoretically speaking, use robust networks with zero {disconnects, packet losses, hardware issues.} But if wishes were horses...
  • 12. 12 Client-quorum to the rescue ● quorum kw r- mˈ o ə (noun): the smallest number of people who must be present at a meeting in order for decisions to be made. ● In glusterFS parlance: Minimum number of bricks that need to be up to allow modifications. ● What does this mean for replica-2? – The client need to be connected to both bricks at all times. – Fault tolerance = No down bricks. i.e. no high availability
  • 13. 13 (...cont'd) Client-quorum to the rescue ➢ How does replica-3 solve the problem? –Uses client quorum in 'auto' mode - i.e. 2 bricks need to be up to allow modifications. –In other words, fault tolerance = 1 down brick
  • 14. 14 (...cont'd) Client-quorum to the rescue ●Network split-brain – A corner case in replica-3 volumes. ●3 Clients simultaneously write to non-overlapping (offset + range): C1 succeeds on B1, B2 C2 succeeds on B2, B3 C3 succeeds on B1, B3
  • 15. 15 (...cont'd) Client-quorum to the rescue ➢ Key point to understand - to prevent split-brain, there will always be a case when you need to block FOPS, no matter how many replicas are used, if the only true copy is unavailable. ➢ What extra replicas give you is to increase in chance that more than one brick hosts the true copy (i.e. the copy has witnessed all writes without any failures)
  • 16. 16 Arbiter volumes ➢ A replica-3 volume where the 3rd brick only stores file metadata and the directory hierarchy.
  • 17. 17 (...cont'd) Arbiter volumes ➢ Consumes much less than 3x space - But *how much* space? We'll see later. ➢ Provides the same level of consistency (not availability) as replica- 3 volumes –i.e. it allows a fault tolerance of 1 down brick. ➢ Perfect sweet spot between 2 way and 3 way replica.
  • 18. 18 Creating Arbiter volumes #gluster volume create testvol replica 3 arbiter 1 host1:brick1 host2:brick2 host3:brick3 #gluster volume info testvol Volume Name: testvol Type: Distributed-Replicate Volume ID: ae6c4162-38c2-4368-ae5d-6bad141a4119 Status: Created Number of Bricks: 2 x (2 + 1) = 6 Transport-type: tcp Bricks: Brick1: host1:/bricks/brick1 Brick2: host2:/bricks/brick2 Brick3: host3:/bricks/brick3 (arbiter) Brick4: host4:/bricks/brick4 Brick5: host5:/bricks/brick5 Brick6: host6:/bricks/brick6 (arbiter)
  • 19. 19 Arbiter volumes- How do they work? ➢ The arbiter translator on (every 3rd) brick process. ➢ The arbitration logic in AFR in the client process.
  • 20. 20 Arbiter volumes- How do they work? Arbitration logic: ➢ Take full locks for writes. ➢ Decide go/no-go before sending the write to the bricks. ➢ Decide to return success/failure after we get the responses from the bricks. Scenarios: a)all 3 bricks (including arbiter) are up → allow writes b)2 bricks are up, arbiter is down →allow writes c)1 brick and arbiter are up → allow writes IFF the arbiter doesn't blame the up brick.
  • 21. 21 Arbiter volumes- How do they work? ➢ Self-heal mechanism - Arbiter can serve as a source for metadata (but not data) ➢ What about performance? -Single writer - No penalty. -Multi writer - A little overhead due to full file locks.
  • 22. 22 Brick sizing and placement ● What does each arbiter brick store? – Directory hierarchy + xattrs – The .glusterfs folder ● Safer side estimate is 4KB/file x no. of files Practical estimates by community users: 1KB/file x no. of files https://gist.github.com/pfactum/e8265ca07f7b19f30bb3 ● Placement strategies – Low spec machine for holding arbiter bricks of all replicas. – Daisy chaining: Place arbiter bricks of different replicas in different nodes that host the data bricks.
  • 23. 23 Monitoring volumes ● ENOTCONN messages in client logs Need not necessarily mean a lost connection. Can be due to the arbitration logic to prevent split-brains. ● #gluster volume heal volname info -> Monitor heals ● #gluster volume heal volname info split-brain -->Must always shows zero entries (Why? Because Arbiter volumes duh!) --> But it doesn't? Report a bug!
  • 24. 24 Epilogue ● More information on arbiter volumes: http://gluster.readthedocs.io/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/ ● Gluster upstream mailing lists:https://www.gluster.org/mailman/listinfo ● IRC: #gluster-users and #gluster-devel on freenode