3. Introduction
Classification
Lustre
Conclusion
Definition
Allows access to files located on a remote host
In a transparent manner, as though the client is actually
working on the host
Typically, clients do not have access to the underlying
block storage
They interact over the network using a protocol
Anant Narayanan
Distributed File Systems
Cluster and Grid Computing Vrije Universiteit
4. Introduction
Classification
Lustre
Conclusion
Why do we need them?
Distributed applications usually require a common data
store
Eases ability to keep data consistent
Access control is possible both on the server and client
Depending on how the protocol is designed
Allows for implementation of
Replication
Fault tolerance
Anant Narayanan
Distributed File Systems
Cluster and Grid Computing Vrije Universiteit
6. Introduction
Classification
Lustre
Conclusion
Storage
Record Oriented
Were used on Mainframes and Minicomputers
Fetch and put whole records, seek to boundaries
Have a lot in common with today’s databases
Examples: Files-11, Virtual Storage Access Method
(VSAM)
Anant Narayanan
Distributed File Systems
Cluster and Grid Computing Vrije Universiteit
7. Introduction
Classification
Lustre
Conclusion
Storage
Object Oriented
Splits file metadata from file data
File data is further split into objects
Objects stored on object storage servers
May or may not have a block oriented FS at the lowest
layer
Examples: Lustre, XtreemFS
Anant Narayanan
Distributed File Systems
Cluster and Grid Computing Vrije Universiteit
11. Introduction
Classification
Lustre
Conclusion
Overview
Linux + Cluster
Distributed, parallel, fault tolerant, object based file system
Tens of thousands of nodes
Petabytes of storage capacity
Hundreds of Gigabytes / second of throughput
Without compromising on speed or security
Small workgroup clusters, to large-scale, multi-site
clusters, to super-computers
Anant Narayanan
Distributed File Systems
Cluster and Grid Computing Vrije Universiteit
12. Introduction
Lustre Clusters
Classification
Lustre
Conclusion
Lustre clusters contain three kinds of systems:
Architecture
Layout
• File system clients, which can be used to access the file system
• Object storage servers (OSS), which provide file I/O service
• Metadata servers (MDS), which manage the names and directories in the file system
MDS disk storage containing
Metadata Targets (MDT)
Pool of clustered MDS servers
1-100
OSS servers
1-1000s
MDS 1
(active)
MDS 2
(standby)
Elan
Myrinet
InfiniBand
OSS 1
Commodity Storage
OSS 2
Simultaneous
support of multiple
network types
Lustre clients
1-100,000
OSS storage with object
storage targets (OST)
OSS 3
Shared storage
enables failover OSS
OSS 4
Router
OSS 5
GigE
OSS 6
= Failover
OSS 7
Enterprise-Class
Storage Arrays and
SAN Fabric
Figure 1. Systems in a Lustre cluster
Anant Narayanan
Distributed File Systems
Cluster and Grid Computing Vrije Universiteit
13. Introduction
Classification
Lustre
Conclusion
Architecture
Components
File system clients: used to access the file system
Object storage servers (OSS): provide file I/O service,
deals with block storage
Metadata servers (MDS): manage the names and
directories in the file system, deals with authentication
Anant Narayanan
Distributed File Systems
Cluster and Grid Computing Vrije Universiteit
14. 4
Lustre Clusters
Introduction
Sun Microsystems, Inc.
Classification
Lustre
Conclusion
Architecture
The following
Characteristicstable shows the characteristics associated with each of the three types
of systems.
Typical number
of systems
Performance
Required
attached
storage
Desirable
hardware
characteristics
Clients
1–100,000
1 GB/sec I/O,
1000 metadata
ops
None
None
OSS
1–1000
500 MB/sec —
2.5 GB/sec
File system
capacity/OSS
count
Good bus
bandwidth
MDS
2 (in the future
2–100)
3000–15,000
metadata
ops/sec
(operations)
1–2% of file
system capacity
Adequate CPU
power, plenty of
memory
The storage attached to the servers is partitioned, optionally organized with logical
Anant Narayanan
Cluster OSS and MDS
volume management (LVM) and formatted as file systems. The Lustreand Grid Computing Vrije Universiteit
servers read, write, and modify data in the format imposed by these file systems. Each
Distributed File Systems
15. Introduction
Classification
Lustre
Conclusion
Architecture
Heterogeneous?
MDS and OSS may store actual data on ext3 or ZFS block
file systems
Infiniband, TCP/IP over Ethernet and Myrinet are
supported network types
Multiple CPU architectures: x86, x86_64, PPC
Requires patched Linux kernel
Anant Narayanan
Distributed File Systems
Cluster and Grid Computing Vrije Universiteit
17. Introduction
Classification
Lustre
Conclusion
Implementation
Networking
LNET abstracts over multiple supported networks
Provides the communication infrastructure required by
Lustre
Takes care of abstracting over fail-over servers, load
balancing
Provides support for Remote Direct Memory Access
(RDMA)
Provides an end-to-end throughput of 100MB per sec on
Gigabit Ethernet networks
Upto 1.5GB per sec on Infiniband
Anant Narayanan
Distributed File Systems
Cluster and Grid Computing Vrije Universiteit
18. Introduction
Implementation
Where Are the Files?
Classification
Lustre
Conclusion
Traditional UNIX disk file systems use inodes, which contain lists of block numbers
where the file data for the inode is stored. Similarly, one inode exists on the MDT for
each file the Lustre
Where aredoesinnot point to file system. However, in points to one or more objects associated
the files?blocks, but instead the Lustre file system, the inode on the
MDT
data
with the files. This is illustrated in Figure 5.
File on MDT
Extended
Attributes
Ordinary ext3 File
obj1
oss2
obj3
oss3
Direct
Data Blocks
oss1
obj2
Data
Block
ptrs
Indirect
Double
Indirect
inode
inode
Indirect
Data Blocks
Figure 5. MDS inodes point to objects; ext3 inodes point to data
Anant Narayanan
These objects are implemented as files on the OST file systems and contain file data.
Cluster and Grid Computing Vrije Universiteit
Figure 6 shows how a file open operation transfers the object pointers from the MDS
Distributed File Systems
19. Introduction
Classification
Lustre
Conclusion
Implementation
Striping and Replication
One object per MDS inode implies “unstriped” data
Multiple objects per MDS inode implies that the file has
been split, similar to RAID 0
These stripes may be duplicated across several OSS
Provides fault tolerance and high availability
Anant Narayanan
Distributed File Systems
Cluster and Grid Computing Vrije Universiteit
20. Introduction
Classification
Lustre
Conclusion
Conclusion
Reliable, Scalable and Performant filesystem
Open Architecture and Protocols
BUT
Does not handle dynamic addition / removal of servers
Does not provide the kind of access control and security
that a VO might need
Anant Narayanan
Distributed File Systems
Cluster and Grid Computing Vrije Universiteit
22. Introduction
Classification
Lustre
Conclusion
References
You may be required to register on the Sun Website to access these documents!
Datasheet:
http://www.sun.com/software/products/lustre/
datasheet.pdf
Scalable Cluster Filesystem Whitepaper:
http://www.sun.com/offers/docs/LustreFileSystem.pdf
LNET:
http://www.sun.com/offers/docs/
lustre_networking.pdf
Lustre Documentation Index:
http://manual.lustre.org/index.php?title=Main_Page
Lustre Publications Index:
http://wiki.lustre.org/
index.php?title=Lustre_Publications
Anant Narayanan
Distributed File Systems
Cluster and Grid Computing Vrije Universiteit