1. CernVM File System
Workshop
Steve Traylen, steve.traylen@cern.ch
CERN, IT-PS-PES.
EGEE User Forum
28th March 2012
2. Outline
• CvmFS Description
– Design
– Security
• CvmFS at Sites
– Clients, Squids
• CvmFS at Stratum Ones and Zero
– CvmFS for repository maintainers.
• State of CvmFS within WLCG.
• CvmFS Future Work
Steve Traylen, CERN-IT. 2
3. CvmFS Motivation
• CvmFS is a Network File System
– Designed with software distribution in
mind.
• Lots of small files.
• Additions not constant but every few hours or
days.
• Minimize Distribution Delay
• Files are typically accessed more than once.
– Write in one location
• Repository node.
– Read in 100,000s of locations.
Steve Traylen, CERN-IT. 3
4. CvmFS Design
• Indistinguishable from a real filesystem
– Easy for the end user.
• Security
– File integrity is checked by every client on
every file.
• Standard Protocols and Software
– Uses http (not s) everywhere as it is easy.
– apache httpd, squid but whatever.
– Standard linux fuse at the client.
Steve Traylen, CERN-IT. 4
5. CvmFS Deployed
/cvmfs/repo/MyFile Shadow Tree: the one write
Stratum 0
location.
cvmfs_sysc, operates on all new files in repo, e.g MyFile.
/repo/A345....de43b
Public tree: contains hashed compressed
files.
Stratum 0 Web Server - only Stratum 1s ever connect.
Stratum ones copy all new
Stratum
data
with “cvmfs_replicate”.
Stratum 1 Full Copy Stratum 1 Full Copy Geo separated and
1s
fully redundant.
SiteA OnDemand SiteB Partial CacheSquids
Sites
Batch Batch
Steve Traylen, CERN-IT. 5
6. Day in The Life of a File
• Scenario:
– Repository maintainer wants a file on all
batch workers.
• Steps
– File publication , happens on repository
maintainer node.
– File retrieval, happens on all batch workers.
Steve Traylen, CERN-IT. 6
7. File Publication
• Maintainer copies or creates file at
stratum 0:
– This is in the eventual correct path e.g,
• /cvmfs/biomed.example.org/MyFile
• Maintainer tests new file system
– /cvmfs/biomed.example.org
• Maintainer “commits” all new files -
cvmfs_sync
– Files are compressed and renamed to their
sha1.
– An SQLite db has a new record for MyFile
added. Steve Traylen, CERN-IT. 7
8. File Publication (2)
/cvmfs/biomed.example.org/MyFile
cvmfs_sync
•http://example.org/biomed/12a..edf2 - Actual
compressed MyFile
•http://example.org/biomed/23f..ad22C - SQLite Database
containing
| /MyFile | 12a..edf2 | one record per file.
•http://example.org/biomed/.cvmfspublished - A pointer to
catalog
Simple text file with catalog file name ,
12a..edf2.
• The .cvmpublished has a TTL of 15
minutes.
• All other files have a TTL of 3 days.
Steve Traylen, CERN-IT. 8
9. File Retrieval via fuse.
• CvmFS clients are a plugin in to fuse.
– fuse intercepts all filesystem requests,
• e.g stat, ls, cat , gcc, open, ....
• cvmfs handles all file retrieval and presents file
normally to the application.
– a local area of disk is configured as disk
cache.
Steve Traylen, CERN-IT. 9
10. File Retrieval (2)
• Batch job wants the file
– /cvmfs/biomed.examle.org/MyFile
• cvmfs performs the following
– Client downloads .cvmfspublished
• This provides the file name “23f...ad22C” of
sqlite catalog of the user required file paths.
– Client downloads sqlite catalog
• This provides the real on disk file name of
‘MyFile’, i.e 12a..edf2
– Client downloads data file ‘12a..edf2’
• fuse presents ‘12a..edf2’ as MyFile to batchjob.
Steve Traylen, CERN-IT. 10
11. What was the Point of
• Why bother with all that complication
– Why not serve files as is.
• File system layout in sqlite database.
– Operations like ls, stat, find . -type f are very
quick.
– The data is only downloaded as files are opened.
• De-duplication, e.g MyFile and MySameFile.
– All files are saved with name of their sha1.
– The duplicates are just extra rows in sqlite db.
• No point having two files the same in cache or
downloading same file twice.
– Cache slots never need to be overwritten with new version
of file.
Steve Traylen, CERN-IT. 11
12. File Security/Integrity
• Main risks
– Files are being delivered via http.
– Files may pass through 3rd party squids, ...
• files from cern to cern sometimes go via BNL.
• x509 keys and certs are generated.
– public certificate is delivered in advance to all
sites.
– release machine signs the first
file .cvmfspublished at cvmfs_sync time.
• All files opened after this are located by
sha1 name only and the sha1 is verified
for each file.
• This is the simplified version of what
Steve Traylen, CERN-IT. 12
13. CvmFS at Sites - Squid
• CvmFS clients should not connect
directly to stratum one servers.
– A squid or other http proxy should be
installed.
• Can be a squid for a batch farm.
• A university level squid.
• A squid shared with another site.
– Setting up two squids in redundant fashion
is easy.
• Client supports random and/or ordered lists of
squids.
• CvmFS clients are not blocked from
Steve Traylen, CERN-IT. 13
14. Squid Setup
• A standard squid from OS vendor is
perfectly good enough. A few
configurations are important.
– maximum_object_size - specifies max file
size to cache.
• default is 4MB , recommended 4GB.
– cache_dir - specifies size of disk cache.
• default is 100MB , recommended 50GB
minimum.
• Both values depend greatly on active
total and individual file size in
repository. Steve Traylen, CERN-IT. 14
15. Squid Setup (2)
• Site squids are contacted by all batch
workers:
• Following config’s are for large clusters.
– max_filedesc - Increase maximum open
sockets.
• Default 1024, increase to 8192
• Verify usage with: squidclient mgr:info
– Maximum number of file descriptors: 8192
Largest file desc currently in use: 2839
Number of file desc currently in use: 2753
– net.ipv4.neigh.default.gc_thresh* - arp
table.
Steve Traylen, CERN-IT. 15
16. Squid Setup (3)
• CvmFS clients support a list of squid
servers.
– Random list “SquidA|SquidB”
• One site with two squid servers.
– Ordered list
‘SquidSiteMine;SquidSiteOther’
• One site using its own squid in preference to
another site’s squid server.
• CvmFS clients move to next squid if
files cannot be downloaded correctly.
– Files are always checksummed after
Steve Traylen, CERN-IT. 16
17. Squid and Cache Digests
• Cache digests allow a cluster of squids
to work together.
– A pair (or more) site squids or stratum one
squids can benefit.
• Squids peer from one another.
– i.e A site with 3 site squid servers will only
download each file once. After that each
squid will fetch it first from an adjacent
squid rather than going to the higher level
to fetch the file.
• http://wiki.squid-cache.org/SquidFaq/
Steve Traylen, CERN-IT. 17
18. CvmFS at Sites - Client
• Install CvmFS packages via http://cernvm.cern.ch/portal/
filesystem
– Install guide present.
– RHEL 5 and 6 packages, debian has been built
from source.
• Configure either with script (cvmfs_config
setup) or by hand:
– /etc/fuse.conf # Fuse Configuration
• Allow other people to read fuse mount
– /etc/auto.master # AutoFS configuration
• Enable the /etc/auto.cvmfs
• chkconfing cvmfs on && service cvmfs on
• CvmFS clients default to enable e.g /cvmfs/
Steve Traylen, CERN-IT. 18
19. CvmFS Client
• CvmFS uses a default file and override
configuration method.
– /etc/cvmfs/default.conf is in the package
– /etc/cvmfs/default.local is custom
overrides.
• Minimal changes to make:
– Sites should specify a squid service for
their site.
• CVMFS_HTTP_PROXY=http://yoursquid:2138
– Sites should specify an ordered stratum
Steve Traylen, CERN-IT. 19
20. CvmFS Client
• Cache location and size.
– CVMFS_QUOTA_LIMIT=10000 (MB)
– CVMFS_CACHE_BASE=/var/cache/
• Note the cache is exclusive to each
repository.
– A future version of CvmFS will share a
cache across all repositories.
Steve Traylen, CERN-IT. 20
21. CvmFS Client
• Per domain/repository overrides are also
possible:
– /etc/cvmfs/default.conf
• global configuration from package.
– /etc/cvmfs/default.local
• global configuration from site admin.
– /etc/cvmfs/domain.d/example.org.conf
• configuration for *.example.org repos from package
– /etc/cvmfs/domain.d/example.org.local
• configuration for *.example.org repos from site admin
– /etc/cvmfs/config.d/biomed.example.org.conf
• configuration for biomed.example.org from
package.
– /etc/cvmfs/config.d/biomed.example.org.local
Steve Traylen, CERN-IT. 21
22. CvmFS Client
• The previous richness of config allows
for specials per repository - Use cases:
– Repository A requires more cache space
than default.
• Currently 4GB is enough for LHC VOs but LHCb
requires 6.
– Repository B is not supported on all or
different stratum one services.
• Currently ams.cern.ch is only on CERN stratum
one.
Steve Traylen, CERN-IT. 22
23. Debugging Clients
• Dump resulting configuration, all those
files make it complicated.
– cvmfs_config showconfig
• Enable lots of verbosity to a log file:
– CVMFS_DEBUGLOG = /tmp/cvmfs.log
• Files grows quickly so switch off.
• Mount outside the auto mounter
– mkdir /tmp/mnt
mount -t cvmfs biomed /tmp/mnt
• Check syslog
– cvmfs dumps a stack trace on crash.
Steve Traylen, CERN-IT. 23
24. Interrogating Clients
• When CvmFS file system is mounted it
can be spoken to via a socket as root,
e.g
– cvmfs-talk -i atlas host info - determine
which stratum one is being used.
• Active host 0: http://cvmfs1.example.ch/opt/
biomed
– The local cache can be inspected.
• What space is pinned or can be purged.
– The active site squid server can be found.
• Are all my hosts using that remote squid server
and not mine?
Steve Traylen, CERN-IT. 24
25. CvmFS at Stratum 1
• The stratum one level provides all the
redundancy for the clients.
• There should be several stratum ones at
different sites.
• WLCG has 5 stratum ones. 2 or 3 (or
even one) can easily handle the current
load of 70,000 clients providing site
squids are used.
– CERN’s stratum one peaks around 40
megabits.
• Stratum ones update once per hour
from stratum zero.
Steve Traylen, CERN-IT. 25
26. Stratum One Architecture
Stratum 00
Stratum
Stratum one replicates all files from
stratum 0. It uses CvmFS meta data,
i.e SQLite files to only download new
Stratum 1 Backend files.
Stratum one frontends
are reverse proxies. i.e
Stratum 1 Frontend Stratum 1 Frontend web servers that fetch
and cache files from
backend node.
Site A Site B Site C
• Number of sites cannot impact
replication of stratum 0 to stratum 1.
• Stratum 1 can be scaled up with more
front-ends.
Steve Traylen, CERN-IT. 26
27. Stratum 1 downloads, Feb
• Spike on 7th February caused by one
batch cluster connecting directly with a
bug.
– More than trebled sum of all other traffic.
– Site contacted, they changed their
configuration.
• Stratum 1 is vulnerable to this but
plenty of capacity is available, it can
Steve Traylen, CERN-IT. 27
28. CvmFS at Stratum Zero
• The stratum 0 is the one write location.
• Typically a stratum zero is made up of
– A large NFS or similar diskspace with two
areas:
• shadow tree /cvmfs/biomed.example.org
– The write version of the repository
• public tree /pub
– The processed tree served via a web server.
– One small virtual machine per repository.
• Each repository must have its own dedicated
node.
• Write access to the repository controlled with
login access to the node.
Steve Traylen, CERN-IT. 28
29. CvmFS Stratum Zero
• Repository maintainer writes files to
– /cvmfs/biomed.example.org
• A log of all file operations are made.
– This is done with a 3rd party kernel module
- redirfs
• Repository maintainer can now validate
his installation and decide if he wants
to publish.
– Provides a window of opportunity to
uncover mistakes, bad software, ....
Steve Traylen, CERN-IT. 29
30. Stratum Zero Advice
• Stratum Zeros is the point where bad
releases may have to be rolled back.
– Once a bad release has been published it
will be visible at all sites in your entire
infrastructure possibly declaring your
whole infrastructure useless.
• Within WLCG stratum zero, filesystem
snapshots are in place to allow a
rollback.
– Various mechanisms have been used, e.g
• Netapp, LVM and ZFS snapshots have all been
Steve Traylen, CERN-IT. 30
31. Stratum Zero Failure
• The stratum ones continue to serve all
their existing files.
• Clients will not notice in anyway that
the stratum zero is missing.
• During failure new writes to the
repository can not be made.
Steve Traylen, CERN-IT. 31
32. Stratum Zero Security
• Two x509 key pairs are involved:
– Repository managers key.
• Private key lives on repository manager machine
• It is used to sign the .cvmfspublished file
during a release of biomed.example.org.
• Clients do not trust this signature in advance of
release.
– Stratum Zero managers key.
• Private key lives offline , e.g on crypto card.
• Public certificate is deployed to every single
CvmFS client.
– CvmFS clients trust this service managers key
completely.
Steve Traylen, CERN-IT. 32
33. Stratum Zero Security(2)
• Once per month a file (.cvmfswhitelist)
is injected into biomed repository by
the Stratum 0 manager.
– The whitelist file is signed by the Stratum 0
manager and contains a list of repository
manager identities.
• The file states to the client:
– Given you trust me please also trust these
release manager machines for the next
month.
• The client checks the whitelist first to
Steve Traylen, CERN-IT. 33
34. Atlas Comments on CvmFS
• Currently used for
– Software both stable and nightly builds.
– Conditions data
– Around 0.5TB of files are served.
• While CvmFS is recommended for sites
it is not universally used yet.
– Some sites unwilling/unable to install fuse
clients.
• policy, diskless, only nfs space or similar
weirdness.
– To use CvmFS at these sites they require
both:
Steve Traylen, CERN-IT. 34
35. CvmFS Current/Future
• Migration from automake to cmake.
• MacOS client - available but no official
release.
• Shared cache on client between
repositories.
• A cvmfs plugin to parrot , i.e user
space.
• Server side to use AUFS for release
changes.
• AUFS = Advanced multilayered unification
filesystem.
Steve Traylen, CERN-IT. 35
36. Support
• A mailing list hosted at http://cern.ch/
egroups
– cvmfs-talk@cern.ch
• Bug tracker:
– https://savannah.cern.ch/projects/cernvm/
• Source code migrating now.
– Current Release - cern svn.
– Devel - http://github.com/cvmfs
• Release and documentation:
– http://cernvm.cern.ch/portal/filesystem
Steve Traylen, CERN-IT. 36
37. Conclusions
• CvmFS solves well the problem of file
distribution to 100,000s of clients in a
fast, efficient and secure way.
• CvmFS is mission critical today for
ATLAS, LHCb and shortly CMS.
• It is easy to set up the client so long as
fuse is acceptable.
• The server side has been setup for
other VOs outside WLCG in particular at
SLAC and OSG. INFN and SARA have
Steve Traylen, CERN-IT. 37