Distributed File Systems: An Overview

Introduction

Classiﬁcation

Lustre

Conclusion

Distributed File Systems
An Overview
Anant Narayanan
Cluster and Grid Computing
Vrije Universiteit

11 March 2009

Anant Narayanan

Cluster and Grid Computing Vrije Universiteit

Introduction

Classiﬁcation

Lustre

Conclusion

Outline
Introduction
Classiﬁcation
Storage
Fault Tolerance
Applications
Lustre
Overview
Architecture
Implementation

Anant Narayanan


Introduction

Classification

Lustre

Conclusion

Definition

Allows access to files located on a remote host
In a transparent manner, as though the client is actually
working on the host

Typically, clients do not have access to the underlying
block storage
They interact over the network using a protocol

Anant Narayanan


Introduction

Classiﬁcation

Lustre

Conclusion

Why do we need them?
Distributed applications usually require a common data
store
Eases ability to keep data consistent
Access control is possible both on the server and client
Depending on how the protocol is designed

Allows for implementation of
Replication
Fault tolerance

Anant Narayanan


Introduction

Classification

Lustre

Conclusion

Storage

Block Oriented

“Usual” meaning of a file system
Deal with storing data on a block basis
Most distributed file systems are based on this at the
lowest level
Examples: ext3, NTFS, HFS+

Anant Narayanan


Introduction

Classiﬁcation

Lustre

Conclusion

Storage

Record Oriented

Were used on Mainframes and Minicomputers
Fetch and put whole records, seek to boundaries
Have a lot in common with today’s databases
Examples: Files-11, Virtual Storage Access Method
(VSAM)

Anant Narayanan


Introduction

Classification

Lustre

Conclusion

Storage

Object Oriented

Splits file metadata from file data
File data is further split into objects
Objects stored on object storage servers
May or may not have a block oriented FS at the lowest
layer
Examples: Lustre, XtreemFS

Anant Narayanan


Introduction

Classiﬁcation

Lustre

Conclusion

Fault Tolerance

High Availability

Replication
Parallel Striping
Examples: Brtfs, Coda, GlusterFS

Anant Narayanan


Introduction

Classiﬁcation

Lustre

Conclusion

Applications

Clusters

Shared disk systems (GFS)
Distributed disk systems (Lustre)
Typically used on local networks
Fast network access

Anant Narayanan


Introduction

Classiﬁcation

Lustre

Conclusion

Applications

Grids or Clouds

Dynamic nature
Deal with heterogeneity
Deal with VOs (Grids) and SLAs (Clouds)
Examples: XtreemFS, Dynamo

Anant Narayanan


Introduction

Classiﬁcation

Lustre

Conclusion

Overview

Linux + Cluster
Distributed, parallel, fault tolerant, object based ﬁle system
Tens of thousands of nodes
Petabytes of storage capacity
Hundreds of Gigabytes / second of throughput
Without compromising on speed or security
Small workgroup clusters, to large-scale, multi-site
clusters, to super-computers

Anant Narayanan


Introduction

Lustre Clusters
Classiﬁcation

Lustre

Conclusion

Lustre clusters contain three kinds of systems:
Architecture

Layout

• File system clients, which can be used to access the file system
• Object storage servers (OSS), which provide file I/O service
• Metadata servers (MDS), which manage the names and directories in the file system
MDS disk storage containing
Metadata Targets (MDT)

Pool of clustered MDS servers
1-100

OSS servers
1-1000s

MDS 1
(active)

MDS 2
(standby)

Elan
Myrinet
InfiniBand

OSS 1

Commodity Storage

OSS 2

Simultaneous
support of multiple
network types

Lustre clients
1-100,000

OSS storage with object
storage targets (OST)

OSS 3

Shared storage
enables failover OSS

OSS 4

Router

OSS 5

GigE
OSS 6

= Failover

OSS 7

Enterprise-Class
Storage Arrays and
SAN Fabric

Figure 1. Systems in a Lustre cluster

Anant Narayanan


Introduction

Classification

Lustre

Conclusion

Architecture

Components

File system clients: used to access the file system
Object storage servers (OSS): provide file I/O service,
deals with block storage
Metadata servers (MDS): manage the names and
directories in the file system, deals with authentication

Anant Narayanan


4

Lustre Clusters

Introduction

Sun Microsystems, Inc.

Classiﬁcation

Lustre

Conclusion

Architecture

The following
Characteristicstable shows the characteristics associated with each of the three types
of systems.
Typical number
of systems

Performance

Required
attached
storage

Desirable
hardware
characteristics

Clients

1–100,000

1 GB/sec I/O,
1000 metadata
ops

None

None

OSS

1–1000

500 MB/sec —
2.5 GB/sec

File system
capacity/OSS
count

Good bus
bandwidth

MDS

2 (in the future
2–100)

3000–15,000
metadata
ops/sec
(operations)

1–2% of file
system capacity

Adequate CPU
power, plenty of
memory

The storage attached to the servers is partitioned, optionally organized with logical
Anant Narayanan

Cluster OSS and MDS
volume management (LVM) and formatted as file systems. The Lustreand Grid Computing Vrije Universiteit

servers read, write, and modify data in the format imposed by these file systems. Each

Introduction

Classification

Lustre

Conclusion

Architecture

Heterogeneous?

MDS and OSS may store actual data on ext3 or ZFS block
file systems
Infiniband, TCP/IP over Ethernet and Myrinet are
supported network types
Multiple CPU architectures: x86, x86_64, PPC
Requires patched Linux kernel

Anant Narayanan


Introduction

Classiﬁcation

Lustre

Conclusion

Implementation

Setup
MDS
mkfs.lustre -mdt -mgs -fsname=large-fs
/dev/sdamount -t lustre /dev/sda /mnt/mdt
OSS1
mkfs.lustre -ost -fsname=large-fs
-mgsnode=mds@tcp0 /dev/sdb mount -t lustre
/dev/sdb /mnt/ost1
Client
mount -t lustre mds.your.org:/large-fs
/mnt/lustre-client

Anant Narayanan


Introduction

Classiﬁcation

Lustre

Conclusion

Implementation

Networking
LNET abstracts over multiple supported networks
Provides the communication infrastructure required by
Lustre
Takes care of abstracting over fail-over servers, load
balancing
Provides support for Remote Direct Memory Access
(RDMA)
Provides an end-to-end throughput of 100MB per sec on
Gigabit Ethernet networks
Upto 1.5GB per sec on Inﬁniband
Anant Narayanan


Introduction

Implementation

Where Are the Files?
Classiﬁcation

Lustre

Conclusion

Traditional UNIX disk file systems use inodes, which contain lists of block numbers
where the file data for the inode is stored. Similarly, one inode exists on the MDT for

each file the Lustre
Where aredoesinnot point to file system. However, in points to one or more objects associated
the ﬁles?blocks, but instead the Lustre file system, the inode on the
MDT
data
with the files. This is illustrated in Figure 5.

File on MDT
Extended
Attributes

Ordinary ext3 File
obj1

oss2

obj3

oss3

Direct
Data Blocks

oss1

obj2

Data
Block
ptrs
Indirect
Double
Indirect

inode

inode

Indirect
Data Blocks

Figure 5. MDS inodes point to objects; ext3 inodes point to data

Anant Narayanan

These objects are implemented as files on the OST file systems and contain file data.
Figure 6 shows how a file open operation transfers the object pointers from the MDS


Introduction

Classiﬁcation

Lustre

Conclusion

Implementation

Striping and Replication

One object per MDS inode implies “unstriped” data
Multiple objects per MDS inode implies that the ﬁle has
been split, similar to RAID 0
These stripes may be duplicated across several OSS
Provides fault tolerance and high availability

Anant Narayanan


Introduction

Classiﬁcation

Lustre

Conclusion

Conclusion

Reliable, Scalable and Performant ﬁlesystem
Open Architecture and Protocols
BUT
Does not handle dynamic addition / removal of servers
Does not provide the kind of access control and security
that a VO might need

Anant Narayanan


Introduction

Classiﬁcation

Lustre

Conclusion

What next?

Other distributed ﬁle systems and implementation details
Grids (XtreemFS), Clouds (Dynamo)
Questions?

Anant Narayanan


Introduction

Classiﬁcation

Lustre

Conclusion

References
You may be required to register on the Sun Website to access these documents!

Datasheet:
http://www.sun.com/software/products/lustre/
datasheet.pdf
Scalable Cluster Filesystem Whitepaper:
http://www.sun.com/offers/docs/LustreFileSystem.pdf
LNET:
http://www.sun.com/offers/docs/
lustre_networking.pdf
Lustre Documentation Index:
http://manual.lustre.org/index.php?title=Main_Page
Lustre Publications Index:
http://wiki.lustre.org/
index.php?title=Lustre_Publications
Anant Narayanan


Distributed File Systems: An Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distributed File Systems: An Overview

Similar to Distributed File Systems: An Overview (20)

More from Anant Narayanan

More from Anant Narayanan (20)

Recently uploaded

Recently uploaded (20)

Distributed File Systems: An Overview