Hands On Gluster with Jeff Darcy

Gluster
Tutorial
Jeff Darcy, Red Hat
LISA 2016 (Boston)

Agenda
▸ Alternating info-dump and hands-on
▹ This is part of the info-dump ;)
▸ Gluster basics
▸ Initial setup
▸ Extra features
▸ Maintenance and trouble-shooting

Who Am I?
▸ One of three project-wide architects
▸ First Red Hat employee to be seriously
involved with Gluster (before
acquisition)
▸ Previously worked on NFS (v2..v4),
Lustre, PVFS2, others
▸ General distributed-storage blatherer
▹ http://pl.atyp.us / @Obdurodon

TEMPLATE CREDITS
Special thanks to all the people who made and released these
awesome resources for free:
▸ Presentation template by SlidesCarnival
▸ Photographs by Death to the Stock Photo (license)

Some Terminology
▸ A brick is simply a directory on a server
▸ We use translators to combine bricks
into more complex subvolumes
▹ For scale, replication, sharding, ...
▸ This forms a translator graph,
contained in a volfile
▸ Internal daemons (e.g. self heal) use the
same bricks arranged into slightly
different volfiles

Hands On: Getting Started
1. Use the RHGS test drive
▹ http://bit.ly/glustertestdrive
2. Start a Fedora/CentOS VM
▹ Use yum/dnf to install gluster
▹ base, libs, server, fuse, client-xlators, cli
3. Docker Docker Docker
▹ https://github.com/gluster/gluster-containers

Brick / Translator Example
Server A
/brick1
Server B
/brick2
Server C
/brick3
Server D
/brick4

Server A
/brick1
Server B
/brick2
Replica
Set 1
Server C
/brick3
Server D
/brick4
Replica
Set 2
A subvolume
Also a subvolume

Server A
/brick1
Server B
/brick2
Replica
Set 1
Server C
/brick3
Server D
/brick4
Replica
Set 2
Volume
“fubar”

Translator Patterns
Server A
/brick1
Server B
/brick2
Replica
Set 1
Fan-out or “cluster”
e.g. AFR, EC, DHT, ...
AFR
md-cache
Pass through
e.g. performance

Access Methods
FUSE
Samba
Ganesha
TCMU
GFAPI
Self heal
Rebalance
Quota
Snapshot
Bitrot

GlusterD
▸ Management daemon
▸ Maintains membership, detects server
failures
▸ Stages configuration changes
▸ Starts and monitors other daemons

Simple Configuration Example
serverA# gluster peer probe serverB
serverA# gluster volume create fubar
replica 2
serverA:/brick1 serverB:/brick2
serverA# gluster volume start fubar
clientX# mount -t glusterfs serverA:fubar
/mnt/gluster_fubar

Hands On: Connect Servers
[root@vagrant-testVM glusterfs]# gluster peer probe
192.168.121.66
peer probe: success.
[root@vagrant-testVM glusterfs]# gluster peer status
Number of Peers: 1
Hostname: 192.168.121.66
Uuid: 95aee0b5-c816-445b-8dbc-f88da7e95660
State: Accepted peer request (Connected)

Hands On: Server Volume Setup
[root@vagrant-testVM glusterfs]# gluster volume create fubar
replica 2 testvm:/d/backends/fubar{0,1} force
volume create: fubar: success: please start the volume to
access data
[root@vagrant-testVM glusterfs]# gluster volume info fubar
... (see for yourself)
[root@vagrant-testVM glusterfs]# gluster volume status fubar
Volume fubar is not started

Hands On: Server Volume Setup
[root@vagrant-testVM glusterfs]# gluster volume start fubar
volume start: fubar: success
[root@vagrant-testVM glusterfs]# gluster volume status fubar
Status of volume: fubar
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick testvm:/d/backends/fubar0 49152 0 Y 13104
Brick testvm:/d/backends/fubar1 49153 0 Y 13133
Self-heal Daemon on localhost N/A N/A Y 13163
Task Status of Volume fubar
------------------------------------------------------------------------------
There are no active volume tasks

Hands On: Client Volume Setup
[root@vagrant-testVM glusterfs]# mount -t glusterfs testvm:fubar
/mnt/glusterfs/0
[root@vagrant-testVM glusterfs]# df /mnt/glusterfs/0
Filesystem 1K-blocks Used Available Use% Mounted on
testvm:fubar 5232640 33280 5199360 1% /mnt/glusterfs/0
[root@vagrant-testVM glusterfs]# ls -a /mnt/glusterfs/0
. ..
[root@vagrant-testVM glusterfs]# ls -a /d/backends/fubar0
. .. .glusterfs

Hands On: It’s a Filesystem!
▸ Create some files
▸ Create directories, symlinks, ...
▸ Rename, delete, ...
▸ Test performance
▹ OK, not yet

Distribution and Rebalancing
Server X’s range Server Y’s range
0 0x7fffffff 0xffffffff
● Each brick “claims” a range of hash values
○ Collection of claims is called a layout
● Files (dots) are hashed, placed on brick
claiming that range
● When bricks are added, claims are adjusted to
minimize data motion

Distribution and Rebalancing
0 0x80000000 0xffffffff
0 0x55555555 0xaaaaaaaa 0xffffffff
Server Z’s range
Move X->Z Move Y->Z

Sharding
▸ Divides files into chunks
▸ Each chunk is placed separately
according to hash
▸ High probability (not certainty) of
chunks being on different subvolumes
▸ Spreads capacity and I/O across
subvolumes

Hands On: Adding a Brick
[root@vagrant-testVM glusterfs]# gluster volume create xyzzy
testvm:/d/backends/xyzzy{0,1}
[root@vagrant-testVM glusterfs]# getfattr -d -e hex
-m trusted.glusterfs.dht /d/backends/xyzzy{0,1}
# file: d/backends/xyzzy0
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff

[root@vagrant-testVM glusterfs]# gluster volume add-brick xyzzy
testvm:/d/backends/xyzzy2
volume add-brick: success
[root@vagrant-testVM glusterfs]# gluster volume rebalance xyzzy
fix-layout start
volume rebalance: xyzzy: success: Rebalance on xyzzy has been started
successfully. Use rebalance status command to check status of the
rebalance process.
ID: 88782248-7c12-4ba8-97f6-f5ce6815963

[root@vagrant-testVM glusterfs]# getfattr -d -e hex -m
trusted.glusterfs.dht /d/backends/xyzzy{0,1,2}
trusted.glusterfs.dht=0x00000001000000000000000055555554
trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff
trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9

Split Brain (problem definition)
▸ “Split brain” is when we don’t have
enough information to determine
correct recovery action
▸ Can be caused by node failure or
network partition
▸ Every distributed data store has to
prevent and/or deal with it

How Replication Works
▸ Client sends operation (e.g. write) to all
replicas directly
▸ Coordination: pre-op, post-op, locking
▹ enables recovery in case of failure
▸ Self-heal (repair) usually done by
internal daemon

Split Brain (how it happens)
Server A
Client X
Client Y
Server B
Network
partition

Split Brain (what it looks like)
[root@vagrant-testVM glusterfs]# ls /mnt/glusterfs/0
ls: cannot access /mnt/glusterfs/0/best-sf: Input/output error
best-sf
[root@vagrant-testVM glusterfs]# cat /mnt/glusterfs/0/best-sf
cat: /mnt/glusterfs/0/best-sf: Input/output error
[root@vagrant-testVM glusterfs]# cat /d/backends/fubar0/best-sf
star trek
[root@vagrant-testVM glusterfs]# cat /d/backends/fubar1/best-sf
star wars
What the...?

Split Brain (dealing with it)
▸ Primary mechanism: quorum
▹ server side, client side, or both
▹ arbiters
▸ Secondary: rule-based resolution
▹ e.g. largest, latest timestamp
▹ Thanks, Facebook!
▸ Last choice: manual repair

Server Side Quorum
Brick A Brick B Brick C
Client X Client Y
Writes succeed Has no servers
Forced down

Client Side Quorum
Brick A Brick B Brick C
Client X Client Y
Writes succeed Writes rejected locally
(EROFS)
Stays up

Erasure Coding
▸ Encode N input blocks into N+K output
blocks, so that original can be recovered
from any N.
▸ RAID is erasure coding with K=1 (RAID 5)
or K=2 (RAID 6)
▸ Our implementation mostly has the
same flow as replication

Quota
▸ Gluster supports directory-level quota
▸ For nested directories, lowest applicable
limit applies
▸ Soft and hard limits
▹ Exceeding soft limit gets logged
▹ Exceeding hard limit gets EDQUOT

Quota
▸ Problem: global vs. local limits
▹ quota is global (per volume)
▹ files are pseudo-randomly distributed
across bricks
▸ How do we enforce this?
▸ Quota daemon exists to handle this
coordination

Hands On: Quota
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy enable
volume quota : success
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy soft-timeout 0
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy hard-timeout 0
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy
limit-usage /john 100MB

Hands On: Quota
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy list
Path Hard-limit Soft-limit
-----------------------------------------------------------------
/john 100.0MB 80%(80.0MB)
Used Available Soft-limit exceeded? Hard-limit exceeded?
--------------------------------------------------------------
0Bytes 100.0MB No No

Hands On: Quota
[root@vagrant-testVM glusterfs]# dd if=/dev/zero
of=/mnt/glusterfs/0/john/bigfile bs=1048576 count=85 conv=sync
85+0 records in
85+0 records out
89128960 bytes (89 MB) copied, 1.83037 s, 48.7 MB/s
[root@vagrant-testVM glusterfs]# grep -i john /var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/d-backends-xyzzy0.log:[2016-11-29 14:31:44.581934]
A [MSGID: 120004] [quota.c:4973:quota_log_usage] 0-xyzzy-quota: Usage
crossed soft limit: 80.0MB used by /john

Hands On: Quota
[root@vagrant-testVM glusterfs]# dd if=/dev/zero
of=/mnt/glusterfs/0/john/bigfile2 bs=1048576 count=85 conv=sync
dd: error writing '''/mnt/glusterfs/0/john/bigfile2''': Disk quota exceeded
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy list | cut -c
66-
Used Available Soft-limit exceeded? Hard-limit exceeded?
--------------------------------------------------------------
101.9MB 0Bytes Yes Yes

Snapshots
▸ Gluster supports read-only snapshots
and writable clones of snapshots
▸ Also, snapshot restores
▸ Support is based on / tied to LVM thin
provisioning
▹ originally supposed to be more
platform-agnostic
▹ maybe some day it really will be

Hands On: Snapshots
[root@vagrant-testVM glusterfs]# fallocate -l $((100*1024*1024))
/tmp/snap-brick0
[root@vagrant-testVM glusterfs]# losetup --show -f /tmp/snap-brick0
/dev/loop3
[root@vagrant-testVM glusterfs]# vgcreate snap-vg0 /dev/loop3
Volume group "snap-vg0" successfully created

Hands On: Snapshots
[root@vagrant-testVM glusterfs]# lvcreate -L 50MB -T /dev/snap-vg0/thinpool
Rounding up size to full physical extent 52.00 MiB
Logical volume "thinpool" created.
[root@vagrant-testVM glusterfs]# lvcreate -V 200MB -T /dev/snap-vg0/thinpool
-n snap-lv0
Logical volume "snap-lv0" created.
[root@vagrant-testVM glusterfs]# mkfs.xfs /dev/snap-vg0/snap-lv0
...
[root@vagrant-testVM glusterfs]# mount /dev/snap-vg0/snap-lv0
/d/backends/xyzzy0
...

Hands On: Snapshots
[root@vagrant-testVM glusterfs]# gluster volume create xyzzy
testvm:/d/backends/xyzzy{0,1} force
[root@vagrant-testVM glusterfs]# echo hello > /mnt/glusterfs/0/file1
[root@vagrant-testVM glusterfs]# gluster snapshot create snap1 xyzzy
snapshot create: success: Snap snap1_GMT-2016.11.29-14.57.11 created
successfully

Hands On: Snapshots
[root@vagrant-testVM glusterfs]# gluster snapshot activate
snap1_GMT-2016.11.29-14.57.11
Snapshot activate: snap1_GMT-2016.11.29-14.57.11: Snap activated
successfully
[root@vagrant-testVM glusterfs]# mount -t glusterfs
testvm:/snaps/snap1_GMT-2016.11.29-14.57.11/xyzzy /mnt/glusterfs/1
file1 file2
-bash: /mnt/glusterfs/1/file3: Read-only file system

Hands On: Snapshots
[root@vagrant-testVM glusterfs]# gluster snapshot clone clone1
snap1_GMT-2016.11.29-14.57.11
snapshot clone: success: Clone clone1 created successfully
[root@vagrant-testVM glusterfs]# gluster volume start clone1
volume start: clone1: success
[root@vagrant-testVM glusterfs]# mount -t glusterfs testvm:/clone1
/mnt/glusterfs/2
[root@vagrant-testVM glusterfs]# echo goodbye > /mnt/glusterfs/2/file3

Hands On: Snapshots
# Unmount and stop clone.
# Stop original volume - but leave snapshot activated!
[root@vagrant-testVM glusterfs]# gluster snapshot restore snap1_GMT-2016.11.29-14.57.11
Restore operation will replace the original volume with the snapshotted volume. Do you still want to
continue? (y/n) y
Snapshot restore: snap1_GMT-2016.11.29-14.57.11: Snap restored successfully
[root@vagrant-testVM glusterfs]# gluster volume start xyzzy
volume start: xyzzy: success
file1 file2

Other Features
▸ Geo-replication
▸ Bitrot detection
▸ Transport security
▸ Encryption, compression/dedup etc. can
be done locally on bricks

Gluster 4.x
▸ GlusterD 2
▹ higher scale + interfaces + smarts
▸ Server-side replication
▸ DHT improvements for scale
▸ More multitenancy
▹ subvolume mounts, throttling/QoS

Thank You!
http://gluster.org
jdarcy@redhat.com

Hands On Gluster with Jeff Darcy

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Hands On Gluster with Jeff Darcy

Semelhante a Hands On Gluster with Jeff Darcy (20)

Mais de Gluster.org

Mais de Gluster.org (20)

Último

Último (20)

Hands On Gluster with Jeff Darcy