2. Agenda
▸ Alternating info-dump and hands-on
▹ This is part of the info-dump ;)
▸ Gluster basics
▸ Initial setup
▸ Extra features
▸ Maintenance and trouble-shooting
3. Who Am I?
▸ One of three project-wide architects
▸ First Red Hat employee to be seriously
involved with Gluster (before
acquisition)
▸ Previously worked on NFS (v2..v4),
Lustre, PVFS2, others
▸ General distributed-storage blatherer
▹ http://pl.atyp.us / @Obdurodon
4. TEMPLATE CREDITS
Special thanks to all the people who made and released these
awesome resources for free:
▸ Presentation template by SlidesCarnival
▸ Photographs by Death to the Stock Photo (license)
5. Some Terminology
▸ A brick is simply a directory on a server
▸ We use translators to combine bricks
into more complex subvolumes
▹ For scale, replication, sharding, ...
▸ This forms a translator graph,
contained in a volfile
▸ Internal daemons (e.g. self heal) use the
same bricks arranged into slightly
different volfiles
6. Hands On: Getting Started
1. Use the RHGS test drive
▹ http://bit.ly/glustertestdrive
2. Start a Fedora/CentOS VM
▹ Use yum/dnf to install gluster
▹ base, libs, server, fuse, client-xlators, cli
3. Docker Docker Docker
▹ https://github.com/gluster/gluster-containers
7. Brick / Translator Example
Server A
/brick1
Server B
/brick2
Server C
/brick3
Server D
/brick4
8. Brick / Translator Example
Server A
/brick1
Server B
/brick2
Replica
Set 1
Server C
/brick3
Server D
/brick4
Replica
Set 2
A subvolume
Also a subvolume
9. Brick / Translator Example
Server A
/brick1
Server B
/brick2
Replica
Set 1
Server C
/brick3
Server D
/brick4
Replica
Set 2
Volume
“fubar”
14. Hands On: Connect Servers
[root@vagrant-testVM glusterfs]# gluster peer probe
192.168.121.66
peer probe: success.
[root@vagrant-testVM glusterfs]# gluster peer status
Number of Peers: 1
Hostname: 192.168.121.66
Uuid: 95aee0b5-c816-445b-8dbc-f88da7e95660
State: Accepted peer request (Connected)
15. Hands On: Server Volume Setup
[root@vagrant-testVM glusterfs]# gluster volume create fubar
replica 2 testvm:/d/backends/fubar{0,1} force
volume create: fubar: success: please start the volume to
access data
[root@vagrant-testVM glusterfs]# gluster volume info fubar
... (see for yourself)
[root@vagrant-testVM glusterfs]# gluster volume status fubar
Volume fubar is not started
16. Hands On: Server Volume Setup
[root@vagrant-testVM glusterfs]# gluster volume start fubar
volume start: fubar: success
[root@vagrant-testVM glusterfs]# gluster volume status fubar
Status of volume: fubar
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick testvm:/d/backends/fubar0 49152 0 Y 13104
Brick testvm:/d/backends/fubar1 49153 0 Y 13133
Self-heal Daemon on localhost N/A N/A Y 13163
Task Status of Volume fubar
------------------------------------------------------------------------------
There are no active volume tasks
17. Hands On: Client Volume Setup
[root@vagrant-testVM glusterfs]# mount -t glusterfs testvm:fubar
/mnt/glusterfs/0
[root@vagrant-testVM glusterfs]# df /mnt/glusterfs/0
Filesystem 1K-blocks Used Available Use% Mounted on
testvm:fubar 5232640 33280 5199360 1% /mnt/glusterfs/0
[root@vagrant-testVM glusterfs]# ls -a /mnt/glusterfs/0
. ..
[root@vagrant-testVM glusterfs]# ls -a /d/backends/fubar0
. .. .glusterfs
18. Hands On: It’s a Filesystem!
▸ Create some files
▸ Create directories, symlinks, ...
▸ Rename, delete, ...
▸ Test performance
▹ OK, not yet
19. Distribution and Rebalancing
Server X’s range Server Y’s range
0 0x7fffffff 0xffffffff
● Each brick “claims” a range of hash values
○ Collection of claims is called a layout
● Files (dots) are hashed, placed on brick
claiming that range
● When bricks are added, claims are adjusted to
minimize data motion
20. Distribution and Rebalancing
Server X’s range Server Y’s range
0 0x80000000 0xffffffff
Server X’s range Server Y’s range
0 0x55555555 0xaaaaaaaa 0xffffffff
Server Z’s range
Move X->Z Move Y->Z
21. Sharding
▸ Divides files into chunks
▸ Each chunk is placed separately
according to hash
▸ High probability (not certainty) of
chunks being on different subvolumes
▸ Spreads capacity and I/O across
subvolumes
23. Hands On: Adding a Brick
[root@vagrant-testVM glusterfs]# gluster volume add-brick xyzzy
testvm:/d/backends/xyzzy2
volume add-brick: success
[root@vagrant-testVM glusterfs]# gluster volume rebalance xyzzy
fix-layout start
volume rebalance: xyzzy: success: Rebalance on xyzzy has been started
successfully. Use rebalance status command to check status of the
rebalance process.
ID: 88782248-7c12-4ba8-97f6-f5ce6815963
25. Split Brain (problem definition)
▸ “Split brain” is when we don’t have
enough information to determine
correct recovery action
▸ Can be caused by node failure or
network partition
▸ Every distributed data store has to
prevent and/or deal with it
26. How Replication Works
▸ Client sends operation (e.g. write) to all
replicas directly
▸ Coordination: pre-op, post-op, locking
▹ enables recovery in case of failure
▸ Self-heal (repair) usually done by
internal daemon
27. Split Brain (how it happens)
Server A
Client X
Client Y
Server B
Network
partition
28. Split Brain (what it looks like)
[root@vagrant-testVM glusterfs]# ls /mnt/glusterfs/0
ls: cannot access /mnt/glusterfs/0/best-sf: Input/output error
best-sf
[root@vagrant-testVM glusterfs]# cat /mnt/glusterfs/0/best-sf
cat: /mnt/glusterfs/0/best-sf: Input/output error
[root@vagrant-testVM glusterfs]# cat /d/backends/fubar0/best-sf
star trek
[root@vagrant-testVM glusterfs]# cat /d/backends/fubar1/best-sf
star wars
What the...?
29. Split Brain (dealing with it)
▸ Primary mechanism: quorum
▹ server side, client side, or both
▹ arbiters
▸ Secondary: rule-based resolution
▹ e.g. largest, latest timestamp
▹ Thanks, Facebook!
▸ Last choice: manual repair
30. Server Side Quorum
Brick A Brick B Brick C
Client X Client Y
Writes succeed Has no servers
Forced down
31. Client Side Quorum
Brick A Brick B Brick C
Client X Client Y
Writes succeed Writes rejected locally
(EROFS)
Stays up
32. Erasure Coding
▸ Encode N input blocks into N+K output
blocks, so that original can be recovered
from any N.
▸ RAID is erasure coding with K=1 (RAID 5)
or K=2 (RAID 6)
▸ Our implementation mostly has the
same flow as replication
36. Quota
▸ Gluster supports directory-level quota
▸ For nested directories, lowest applicable
limit applies
▸ Soft and hard limits
▹ Exceeding soft limit gets logged
▹ Exceeding hard limit gets EDQUOT
37. Quota
▸ Problem: global vs. local limits
▹ quota is global (per volume)
▹ files are pseudo-randomly distributed
across bricks
▸ How do we enforce this?
▸ Quota daemon exists to handle this
coordination
39. Hands On: Quota
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy list
Path Hard-limit Soft-limit
-----------------------------------------------------------------
/john 100.0MB 80%(80.0MB)
Used Available Soft-limit exceeded? Hard-limit exceeded?
--------------------------------------------------------------
0Bytes 100.0MB No No
40. Hands On: Quota
[root@vagrant-testVM glusterfs]# dd if=/dev/zero
of=/mnt/glusterfs/0/john/bigfile bs=1048576 count=85 conv=sync
85+0 records in
85+0 records out
89128960 bytes (89 MB) copied, 1.83037 s, 48.7 MB/s
[root@vagrant-testVM glusterfs]# grep -i john /var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/d-backends-xyzzy0.log:[2016-11-29 14:31:44.581934]
A [MSGID: 120004] [quota.c:4973:quota_log_usage] 0-xyzzy-quota: Usage
crossed soft limit: 80.0MB used by /john
41. Hands On: Quota
[root@vagrant-testVM glusterfs]# dd if=/dev/zero
of=/mnt/glusterfs/0/john/bigfile2 bs=1048576 count=85 conv=sync
dd: error writing '''/mnt/glusterfs/0/john/bigfile2''': Disk quota exceeded
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy list | cut -c
66-
Used Available Soft-limit exceeded? Hard-limit exceeded?
--------------------------------------------------------------
101.9MB 0Bytes Yes Yes
42. Snapshots
▸ Gluster supports read-only snapshots
and writable clones of snapshots
▸ Also, snapshot restores
▸ Support is based on / tied to LVM thin
provisioning
▹ originally supposed to be more
platform-agnostic
▹ maybe some day it really will be
43. Hands On: Snapshots
[root@vagrant-testVM glusterfs]# fallocate -l $((100*1024*1024))
/tmp/snap-brick0
[root@vagrant-testVM glusterfs]# losetup --show -f /tmp/snap-brick0
/dev/loop3
[root@vagrant-testVM glusterfs]# vgcreate snap-vg0 /dev/loop3
Volume group "snap-vg0" successfully created
44. Hands On: Snapshots
[root@vagrant-testVM glusterfs]# lvcreate -L 50MB -T /dev/snap-vg0/thinpool
Rounding up size to full physical extent 52.00 MiB
Logical volume "thinpool" created.
[root@vagrant-testVM glusterfs]# lvcreate -V 200MB -T /dev/snap-vg0/thinpool
-n snap-lv0
Logical volume "snap-lv0" created.
[root@vagrant-testVM glusterfs]# mkfs.xfs /dev/snap-vg0/snap-lv0
...
[root@vagrant-testVM glusterfs]# mount /dev/snap-vg0/snap-lv0
/d/backends/xyzzy0
...
48. Hands On: Snapshots
# Unmount and stop clone.
# Stop original volume - but leave snapshot activated!
[root@vagrant-testVM glusterfs]# gluster snapshot restore snap1_GMT-2016.11.29-14.57.11
Restore operation will replace the original volume with the snapshotted volume. Do you still want to
continue? (y/n) y
Snapshot restore: snap1_GMT-2016.11.29-14.57.11: Snap restored successfully
[root@vagrant-testVM glusterfs]# gluster volume start xyzzy
volume start: xyzzy: success
[root@vagrant-testVM glusterfs]# ls /mnt/glusterfs/0
file1 file2