2. Scaling Ceph at CERN
Dan van der Ster (daniel.vanderster@cern.ch)
Data and Storage Service Group | CERN IT Department
3. CERN’s Mission and Tools
● CERN studies the fundamental laws of nature
○ Why do particles have mass?
○ What is our universe made of?
○ Why is there no antimatter left?
○ What was matter like right after the “Big Bang”?
○ …
● The Large Hadron Collider (LHC)
○ Built in a 27km long tunnel, ~200m underground
○ Dipole magnets operated at -271°C (1.9K)
○ Particles do ~11’000 turns/sec, 600 million collisions/sec
○ …
● Detectors
○ Four main experiments, each the size of a cathedral
○ DAQ systems Processing PetaBytes/sec
Scaling Ceph at CERN - D. van der Ster 3
4. Big Data at CERN
Physics Data on CASTOR/EOS
● LHC experiments produce ~10GB/s
25PB/year
User Data on OpenAFS & DFS
● Home directories for 30k users
● Physics analysis development
● Project spaces for applications
Service Data on AFS/NFS
● Databases, admin applications
Tape archival with CASTOR/TSM
● RAW physics outputs
● Desktop/Server backups
Scaling Ceph at CERN - D. van der Ster 4
Service Size Files
OpenAFS 290TB 2.3B
CASTOR 89.0PB 325M
EOS 20.1PB 160M
5. IT Evolution at CERN
Scaling Ceph at CERN - D. van der Ster 5
Cloudifying CERN’s IT infrastructure ...
● Centrally-managed and uniform hardware
○ No more service-specific storage boxes
● OpenStack VMs for most services
○ Building for 100k nodes (mostly for batch processing)
● Attractive desktop storage services
○ Huge demand for a local Dropbox, Google Drive …
● Remote data centre in Budapest
○ More rack space and power, plus disaster recovery
… brings new storage requirements
● Block storage for OpenStack VMs
○ Images and volumes
● Backend storage for existing and new services
○ AFS, NFS, OwnCloud, Data Preservation, ...
● Regional storage
○ Use of our new data centre in Hungary
● Failure tolerance, data checksumming, easy to operate, security, ...
7. 12 racks of disk server quads
Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph
8. Our 3PB Ceph Cluster
Dual Intel Xeon L5640
24 threads incl. HT
Dual 1Gig-E NICs
Only one connected
2x 2TB Hitachi system disks
RAID-1 mirror
1x 240GB OCZ Deneva 2
/var/lib/ceph/mon
48GB RAM
Scaling Ceph at CERN - D. van der Ster 8
Dual Intel Xeon E5-2650
32 threads incl. HT
Dual 10Gig-E NICs
Only one connected
24x 3TB Hitachi disks
Eco drive, ~5900 RPM
3x 2TB Hitachi system disks
Triple mirror
64GB RAM
47 disk servers/1128 OSDs 5 monitors
#
df
-‐h
/mnt/ceph
Filesystem
Size
Used
Avail
Use%
Mounted
on
xxx:6789:/
3.1P
173T
2.9P
6%
/mnt/ceph
9. Use-Cases Being Evaluated
1. Images and Volumes for OpenStack
2. S3 Storage for Data Preservation / Public
Dissemination
3. Physics data storage for archival and/or
analysis
Scaling Ceph at CERN - D. van der Ster 9
#1 is moving into production. #2 and #3 are more
exploratory at the moment.
10. OpenStack Volumes & Images
• Glance: using RBD for ~3 months now.
• Only issue was to increase ulimit -n above 1024 (10k
is good).
• Cinder: testing with close colleagues.
• 126 Cinder Volumes attached today – 56TB used
Scaling Ceph at CERN - D. van der Ster 10
Growing # of volumes/images Usual traffic is ~50-100MB/s
with current usage. (~idle)
11. RBD for OpenStack Volumes
• Before general availability, we need to test and
enable qemu iops/bps throttling
• Otherwise VMs with many IOs can disrupt other
users.
• One ongoing issue is that a few clients are
getting an (infrequent) segfault of qemu during
a VM reboot.
• Happens on VMs with many attached RBD’s.
• Difficult to get a complete (16GB) core dump.
Scaling Ceph at CERN - D. van der Ster 11
12. CASTOR & XRootD/EOS
• Exploring RADOS backend for these two HEP-developed
file systems
• Gateway model, similar to S3 via RADOSGW
• CASTOR needs raw throughput performance (to feed
many tape drives at 250MBps each).
• Striped RWs across many OSDs are important.
• XRootD/EOS may benefit from the highly scalable
namespace to store O(billion) objects
• Bonus: XRootD also offers http/webdav with X509/kerberos,
possibly even fuse mountable.
• Developments are in early stages.
Scaling Ceph at CERN - D. van der Ster 12
14. Configuration and Deployment
• Dumpling 0.67.7
• Fully Puppet-ized
• Automated server deployment,
automated OSD replacement
• Very few custom ceph.conf
options à
• Experimenting with the
filestore
wbthrottle
• we find that disabling it
completely gives better IOps
performance
• But don’t do this!!!
Scaling Ceph at CERN - D. van der Ster 14
mon
osd
down
out
interval
=
900
osd
pool
default
size
=
3
osd
pool
default
min
size
=
1
osd
pool
default
pg
num
=
1024
osd
pool
default
pgp
num
=
1024
osd
pool
default
flag
hashpspool
=
true
osd
max
backfills
=
1
osd
recovery
max
active
=
1
16. General Comments…
• In these ~7 months of running the cluster, there have been very
few problems
• No outages
• No data losses/corruptions
• No unfixable performance issues
• Behaves well during stress tests
• But now we’re starting to get real/varied/creative users, and this
brings up many interesting issues...
• “No amount of stress testing can prepare you for real users”
- Unknown
• (point being, don’t take the next slides to be too negative – I’m
just trying to give helpful advice ;)
Scaling Ceph at CERN - D. van der Ster 16
17. Latency & Slow Requests
• Best latency we can achieve is 20-40ms
• Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster,
but could in a smaller limited use-case cluster (e.g. for Cinder-only)
• Latency can increase dramatically with heavy usage
• Don’t mix latency-bound and throughput-bound users on the same
OSDs
• Local processes scanning the disks can hurt performance
• Add /var/lib/ceph to the updatedb PRUNEPATH
• If you have slow disks like us, you need to understand your disk IO
scheduler – e.g. deadline prefers reads over writes: writes are given a
5 second deadline vs. 500ms for reads!
• Scrubbing!
• Kernel tuning: vm.* sysctl, dirty page flushing, memory
reclaiming…
• “Something is flushing the buffers, blocking the OSD processes”
• Slow requests: monitor them, eliminate them.
Scaling Ceph at CERN - D. van der Ster 17
18. Life with 250 million objects
• Recently, a user decided to write 250 million 1kB objects
• Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster
being full of RBD images, at least in terms of # objects
• It worked – no big problems from holding this many objects.
• Tested single OSD failure: ~7 hours to backfill, including a
double-backfill glitch that we’re trying to understand.
• But now we want to cleanup, and it is not trivial to remove 250M
objects!
• rados rmpool generated quite a load when we rm’d a 3 million object
pool (some OSDs were temporarily marked down).
• Probably due to a mistake in our wbthrottle tuning
Scaling Ceph at CERN - D. van der Ster 18
19. Other backfilling issues
• During a backfilling event (draining a whole server),
we started observing repeated monitor elections
• Caused by the mons’ LevelDBs being so active that the
local SATA disks couldn’t keep up.
• When a mon falls behind, it calls an election
• Could be due to LevelDB compaction…
• We moved /var/lib/ceph/mon to SSDs – no more
elections during backfilling
• Avoid double backfilling when taking an OSD out of
service:
• Start with ceph
osd
crush
rm
<osd
id> !!
• If you mark the OSD out first, then crush rm it, you will
compute a new CRUSH map twice, i.e. backfill twice.
Scaling Ceph at CERN - D. van der Ster 19
20. Fun with CRUSH
• CRUSH is simple yet powerful, so it is tempting to
play with the cluster layout
• But once you have non-zero amounts of data, significant
CRUSH changes will lead to massive data movements,
which create extra disk load and may disrupt users.
• Early CRUSH planning is crucial!
• A network switch is a failure domain, so we should
configure CRUSH to replicate across switches,
right?
• But (assuming we don’t have a private cluster network)
that would send all replication traffic via the switch uplinks
– bottleneck!
• Unclear tradeoff between uptime and performance.
Scaling Ceph at CERN - D. van der Ster 20
21. CRUSH & Data distribution
• CRUSH may give your cluster
an uneven data distribution
• An OSD’s used space will
scale with the number of PGs
assigned to it
• After you have designed your
cluster, created your pools,
started adding data, check the
PG and volume distributions
• reweight-‐by-‐utilization
is useful to iron out an uneven
PG distribution
• The hashpspool flag is also
important if you have many
active pools
Scaling Ceph at CERN - D. van der Ster 21
0
20
40
60
80
100
120
140
160
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30nOSDs
n PGs
Number of OSDs having N PGs
(for pool = volumes)
22. RBD Reliability with 3 Replicas
• RBD devices are chunked across thousands of objects:
• A full 1TB volume is composed of 250,000 4MB objects
• If any single object is lost, the whole RBD can be considered to be corrupted
(obviously, it depends which blocks are lost!)
• If you lose an entire PG, you can consider all RBDs to be lost / corrupted.
• Our incorrect & irrational fears:
• Any simultaneous triple disk failure in the cluster would lead to objects being
lost – and somehow all RBDs would be corrupted.
• As we add OSDs to the cluster, the data gets spread wider, and the chances of
RBD data loss increase.
• But this is wrong!!
• The only triple disk failures that can lead to data loss are those combinations
actively used by PGs – so having e.g. 4096 PGs for RBDs means that only
4096 combinations out of the 10^9 possible combinations matter.
• N_PGs * ~(P_diskfailure^3) / 3!
• We use 4 replicas for the RBD volumes, but this is probably overkill.
Scaling Ceph at CERN - D. van der Ster 22
23. Trust your clients
• There is no server-side per-client throttling
• A few nasty clients can overwhelm an OSD, leading to slow requests
for everyone.
• When you have a high load / slow requests, it is not always
trivial to identify and blacklist/firewall the misbehaving client
• Could use some help in the monitoring: per-client perf stats?
• One of our creative users found a way to make the mon’s
generate 5*40 MBps of outbound network traffic
• Could saturate the mon network, lead to disruptions
• RADOS is not for end-users. A cephx keyring is for trusted
persons only, not for Joe Random User.
Scaling Ceph at CERN - D. van der Ster 23
24. Fat fingers
• A healthy cluster is always vulnerable to human errors
• We’ve thus far avoided any big mistakes
• Used PG splitting to grow a pool from 8 to 2048 PGs
• Leads to unresponsive OSDs who get marked down à degraded objs.
• Safer & now-enforced to grow in 2x or 4x steps
• ulimits, ulimits, ulimits
• With a large number of OSDs (say, more than 500), you will hit num
file and num process limits everywhere:
• Glance, qemu, radosgw, ceph/rados CLI, …
• If you use XFS, don’t put your OSD journal as a file on the disk
• Use a separate partition, the first partition!
• We still need to reinstall our whole cluster to re-partition the OSDs
Scaling Ceph at CERN - D. van der Ster 24
25. Scale up and out
• Scale up: we are demonstrating the viability of a
3PB cluster with O(1000) OSDs.
• What about 10,000 or 100,000 OSDs?
• What about 10,000 or 100,000 clients?
• Many Ceph instances is always an option, but not ideal
• Scale out: our growing data centre in Budapest
brings many options:
• Replicate over the WAN (though, 30ms RTT)
• Tiering / Caching pools (new feature, need to get
experience…)
• Data locality – direct IOs to nearby replica or caching pool
Scaling Ceph at CERN - D. van der Ster 25
27. Summary
• CERN IT infrastructure is undergoing a private
cloud revolution, and Ceph is providing the
underlying storage.
• Our CASTOR and XRootD physics data use-
cases may exploit RADOS for improved
performance/scalability.
• In seven months with a 3PB cluster, we’ve not
had any disasters. Actually it’s working quite
well.
• Presented some lessons learned, I hope they
prove useful in your Ceph explorations.
Scaling Ceph at CERN - D. van der Ster 27