The Evolution of Storage on Linux - FrOSCon - 2015-08-22

2.250 visualizações

Publicada em

Linux and Open Source Software have always played a crucial role in data centers to provide storage in various ways. In this talk, Lenz will give an overview of how storage on Linux has evolved over the years, from local file systems to scalable file systems, logical volume managers and cluster file systems to today's modern file systems and distributed, parallel and fault-tolerant file systems.

Publicada em: Tecnologia
1 comentário
7 gostaram
Sem downloads
Visualizações totais
No SlideShare
A partir de incorporações
Número de incorporações
Incorporações 0
Nenhuma incorporação

Nenhuma nota no slide

The Evolution of Storage on Linux - FrOSCon - 2015-08-22

  1. 1. The Evolution of Storage on Linux Lenz Grimmer <> FrOSCON 2015, Sankt Augustin 22. August 2015
  2. 2. 2 Agenda  A trip down memory lane (pun intended)  Overview of how storage on Linux has evolved  Local file systems and related concepts/technologies  Network Services  Distributed / Cluster filesystems
  3. 3. 3 Introduction  40+ file systems in /fs/  Focus on the most popular/widely used systems  Primary focus on the software side  High-level Descriptions only
  4. 4. 4 Noteworthy Observations / Conclusions  The role of today  Distribution kernels vs. mainline Linux  Honorable mention: Christoph Hellwig  Don‘t miss his talk about the Linux Storage Stack tomorrow (14:00, HS6)  Big Thanks to: LWN,, Thorsten Leemhuis (Heise) and Wikipedia
  5. 5. The early days
  6. 6. 6 MINIX file system  While developing Linux in 1991, Linus required some form of persistent storage  A Minix-compatible file system was the canonical choice:  Well-documented, robust  Exchange data with the host OS (and vice versa)  Severely limited  Max. file/filesystem size: 64MB (16bit block addresses)  14 char file names  Only one time stamp (mtime)
  7. 7. 7 Virtual File System Switch (VFS)  Abstraction / indirection layer to route file oriented system calls to necessary functions in the physical filesystem code to do the I/O  Eased the addition of new file systems  Initially written by Chris Provenzano  Integrated into Linux 0.96  Defines a set of functions that every filesystem has to implement  Three kinds of objects: filesystems, inodes, and open files
  8. 8. 8 Extended File System (ext)  Designed by Rémy Card  Max. file/filesystem size: 2 GB, max. file name size was 255 chars  Metadata structure inspired by the traditional Unix File System (UFS)  Added to Linux 0.96c in April 1992  Issues remained (bad performance, missing time stamps, fragmentation)
  9. 9. 9 Second Extended File System (ext2)  Also implemented by Rémy Card  Introduced in Linux Kernel 0.99 (January 1993)  Designed with extensibility in mind  Adopted advanced ideas from other file systems (e.g. BSD Fast File System), e.g. mtime/ctime/atime, file attributes, BSD/SysV semantics, different block sizes, immutable/append-only files  Initially supported file/file systems sizes up to 2TB (limitation of the block device layer)  Kernel version 2.6.17 (March 2006) extended max. file system size to 32TB (using 8kB Blocks)
  10. 10. 10 FAT/MSDOS  Added to Linux in 1992/1993 by Werner Almesberger  VFAT support was later developed by Gordon Chaffee  VFAT filesystem is compatible with Windows 95/NT long filenames on the FAT filesystem  Initially called xmsdos  Patches for Linux 1.2.x and 1.3.x.  As of Linux 1.3.60, the vfat filesystem is part of the Linux kernel distribution  Mtools as a userland-only alternative
  11. 11. 11 NTFS  NTFS driver for Linux by Martin von Löwis (started around 1996)  Legato Systems later sponsored Anton Altaparmakov to further develop NTFS on Linux since June 2001  Read-only mode only, with no fault-tolerance supported  NFTS-TNG replaced old NTFS driver in Linux 2.5.11 (April 29th, 2002)  NTFS-3G (FUSE-based) by Tuxera (read-write support)
  12. 12. The Age of Journaling Filesystems
  13. 13. 13 Fsck vs. Journaling  Unclean unmounts, too many mount counts, or remounts after a long time period triggered file system checks  Disk drives got bigger  A Journaling file system keeps track of changes not yet committed to the file system's main part in a Journal  Keep track of just metadata changes or data as well  Several file systems were developed in parallel, to alleviate this shortcoming of ext2, namely ext3, XFS, JFS and ReiserFS.
  14. 14. 14 Journaling Block Device layer (JBD)  JBD established as a filesystem-independent service, to be used by any file system  First incarnation of JBD developed by Stephen C. Tweedie together with the ext3 file system  OCFS2 and later ext4 also used JBD and it’s successor JBD2
  15. 15. 15 Third extended filesystem (ext3)  Originally released in September 1999  Written by Stephen Tweedie for the 2.2 branch  Ported to 2.4 kernels by Peter Braam, Andreas Dilger, Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie  Merged with the mainline Linux kernel 2.4.15 (November 2001)  Basically ext2 with journaling capabilities, easy conversion  Max filesystem size: 8TB, Max 32k subdirs/directory
  16. 16. 16 IBM JFS  Rooted in AIX and OS/2 Warp Server (new design in 1995)  Port to Linux started in December 1999 (Dave Kleikamp, Steve Best)  Uses own journaling implementation (metadata only)  Max volume size: 32PB, Max file size: 4PB  Later ported to AIX 5L as JFS2 (April 2001)  JFS 0.0.1 released in Feb. 2000., 0.1.0 (Beta) in August 2000  Version 1.0.0 was released in June 2001  Kernel module since 2.4.18pre9-ac4, Version 1.1.0 was included by Marcelo Tosatti in Linux 2.4.20.
  17. 17. 17 ReiserFS  Early supported by SuSE, Introduced in version 2.4.1 (2001)  The first journaling file system to be included in mainline  Max volume size: 16TB  Based on B+ trees  Metadata-only journaling (block journaling since 2.6.8)  Online resizing  Tail packing block suballocation  Reiser4 still under active development (Edward Shishkin)
  18. 18. 18 SGI XFS  64-bit journaling file system created by Silicon Graphics  SGI IRIX since 1994, GPLed in 2000  Version 1.0 for Linux in May 2001 as Patch against 2.4.2  Merged in 2.6.x and 2.4.25 (Feb 2004)  Steve Lord, Russell Cattelan, Nathan Scott, Jim Mostek  Advanced features, high performance  Max volume size: 16EB
  19. 19. Volume Management
  20. 20. 20 The need for Logical Volume Management  Initially, Linux could only address disks/partitions  Changes to the layout required downtime and shuffling of data  Logical Volume Management abstracts physical disk drives  First incarnation of Linux LVM was introduced in Kernel version 2.4  Heinz Mauelshagen wrote the original LVM code in 1998, inspired by HP-UX's volume manager.
  21. 21. 21 Device Mapper (DM)  A kernel framework for mapping physical block devices onto higher- level virtual block devices  Added in Linux 2.6  Passes data from a virtual block device, which is provided by the device mapper itself, to another block device  Pluggable design  Data can be also modified in transition  Forms the foundation of LVM2/EVMS, RAID and dm-crypt disk encryption and many other useful features
  22. 22. 22 DM Multipath (DM-MPIO)  Consists of kernel components and user-space components  Provides input-output (I/O) fail-over and load-balancing within Linux for block devices  Handles the rerouting of block I/O to an alternate path in the event of a path failure  Can also balance the I/O load across all of the available paths in Fibre Channel (FC) or iSCSI SAN environments  Started as part of a patchset created by Joe Thornber, later maintained by Alasdair G Kergon at Red Hat. Christophe Varoqui maintains the userland multipath tools
  23. 23. 23 DM-Cache  Allows a fast device (e.g. an SSD) to be used as a cache for a slower device (e.g. a rotating disk)  Different policy plugins can be used to change the algorithms used to select which blocks are promoted, demoted, cleaned etc.  Supports writeback and writethrough modes  Requires three physical storage devices to separately store actual data, cache data and required metadata  Joe Thornber, Heinz Mauelshagen and Mike Snitzer  Inclusion into the Linux mainline kernel version 3.9, released on April 28, 2013
  24. 24. 24 LVM2  Based on DM  Flexible storage management  Add/remove disks  Resize/move logical volumes  Move LVs between PVs  Span volumes across multiple physical devices  RAID  Thin provisioning  Cluster Volume Manager
  25. 25. 25 IBM EVMS  IBM-sponsored effort to provide volume management services for Linux  A single, unified system for handling all storage management tasks  Despite many of the features and GUI management tools found in EVMS, LVM2 was preferred  As a result, IBM dropped their kernel driver and reworked their tools to work with LVM2 instead  Development stopped in 2006
  26. 26. Storage Services
  27. 27. 27 NFS  Rick Sladkey original author of the NFS client and also ported the NFS server and the RPC library code. Doug Quale helped extending the kernel to support networking filesystems  NFS Version 2 since 1.2 kernel series  Kernel 2.2.18 a major milestone: mixing Linux NFS with other operating systems' NFS, use file locking reliably over NFS, and NFS Version 3.  NFS Versions 2, 3, and 4 are supported on 2.6 and later kernels. Version 4.1 (Client) at least kernel 2.6.31  NFSv4 for Linux has been under development at CITI and NetApp since 2001
  28. 28. 28 Samba  A free-software re-implementation of the SMB/CIFS networking protocol  Andrew Tridgell started development of Samba in 1992, Jeremy Allison joined early on  Volker Lendecke founded SerNet in 1997, to provide commercial support  Version 3 (2003): file and print services for Microsoft Windows clients and can integrate with a Windows NT 4.0 server domain, either as a Primary Domain Controller (PDC) or as a domain member  Samba4 installations can act as an Active Directory domain controller or member server, at Windows 2008 domain and forest functional levels.
  29. 29. 29 SMB vs.CIFS  SMB "server message block" and CIFS "common internet file system" are protocols. CIFS is the extension of the SMB protocol  “smbfs” was an older FS originated from the Samba project, heavily coupled with the Samba tools (smb.conf, smbmount, etc.). Removed in Linux 2.6.27  CIFS VFS was added to mainline Linux kernels in 2.5.42 Supports advanced network file system features such as locking, Unicode (advanced internationalization), hardlinks, dfs (hierarchical, replicated name space), distributed caching and uses native TCP names. All key network functions implemented in kernel
  30. 30. Current Filesystems
  31. 31. 31 Fourth Extended Filesystem (ext4)  Advanced version of ext3, led by Ted Tso et al  Incorporated scalability and reliability enhancements for supporting large filesystems up to 1EB.  First experimental support for ext4 was merged into Linux 2.6.19, which was released on 29 November 2006.  Ext4 was marked as experimental until Linux 2.6.27  Starting with 2.6.28 (December 2008), ext4 was marked as stable  New extent format reduced metadata overhead (RAM, IO for access, transactions)
  32. 32. 32 Btrfs  Chris Mason (Oracle) in 2007  COW (Snapshots)  Checksums, Compression  RAID, Volume management  Conversion of ext3/4 file systems  Merged into mainline Linux 2.6.29 (March 2009)  Florian Winkler talks about Btrfs today (11:15, HS7)
  33. 33. 33 ZFS  Filesystem and logical volume manager combined  Designed and implemented at Sun Microsystems (Jeff Bonwick, Matthew Ahrens)  Development started in 2001,officially announced in 2004  128bit, COW, Snapshots, Deduplication, RAID  OpenSolaris (CDDL)  Early port based on FUSE  Kernel modules based OpenZFS (2013)  Not included in mainline Linux due to license incompatibilities
  34. 34. Network Storage
  35. 35. 35 Network Block Device (NBD)  Remotely access a block device attached to another system  Userspace Server/Client, Client kernel module  Issues arise if network goes down or server crashes  Markus Pargmann talks about NBD on Sunday (16:30, HS6)
  36. 36. 36 Distributed Replicated Block Device (DRBD)  A shared-nothing, synchronously replicated block device  “RAID1 over Network”  Writes to the primary node are transferred to the lower-level block device and simultaneously propagated to the secondary node  The secondary node then transfers data to its corresponding lower-level block device. All read I/O is performed locally  Fail-over capabilities (Secondary/Primary)  Lars Ellenberg and Philipp Reisner originally submitted code in July 2007  DRBD was merged on 8 December 2009 during the "merge window" for Linux kernel version 2.6.33
  37. 37. Cluster Filesystems
  38. 38. 38 OCFS/OCFS2  Shared disk file system by Oracle  Main focus of OCFS was to accommodate Oracle clustered databases, not POSIX-compliant  OCFS2 designed as a Linux filesystem from scratch  On-disk filesystem implementation heavily inspired by ext3, uses JBD for journaling  OCFS2 integrated into version 2.6.16 of mainline Linux  Max Volume/File Size 4PB (currently limited to 16TB)  Trivia question: what feature do OCFS2 and Btrfs have in common?
  39. 39. 39 GFS/GFS2  Shared disk filesystem, allows concurrent access to the same block storage  Development of GFS began in 1995 and was originally developed by University of Minnesota professor Matthew O'Keefe and a group of students  Originally for SGI IRIX, ported to Linux in 1998  Acquired by Sistina in 2000, turned into proprietary product  OpenGFS fork  Red Hat acquired Sistina in 2003 and released GFS2 under GPL in June 2004  GFS2 and the DLM merged into Linux 2.6.19 (29 November 2006)
  40. 40. 40 Storage Requirements and Challenges  Amount of data to be stored grows exponentially  Today, Storage has to be:  Fault tolerant, reliable  Scalable without limitations or service interruptions  Distributable  Easy to manage / automate  Previous approaches do not address these requirements
  41. 41. Distributed Filesystems
  42. 42. 42 GlusterFS  Aggregates various storage servers over Ethernet or Infiniband RDMA interconnect into one large parallel network file system  Storage bricks export local file systems as volumes  GlusterFS clients create composite virtual volumes from multiple remote servers using stackable „translators“  Translators provide Mirroring, Replication, Striping, etc.  Final volume mounted by client host using its own native protocol via FUSE, using NFS v3 protocol (via built-in server translator)  Originally developed by Gluster, Inc., which was acquired by Red Hat in 2011
  43. 43. 43 Ceph  Initially created by Sage Weil, founded Inktank in 2012  First release in July 2012  Object, block, and file storage from a single distributed computer cluster  Reliable autonomic distributed object store (RADOS)  RADOS Block Device (RBD), Snapshots  RadosGW provides REST API (Amazon S3/OpenStack Swift)  Completely distributed without a single point of failure  Replicates data for fault tolerance (CRUSH)  Ceph client code was merged into mainling Linux version 2.6.34  Red Hat acquired Inktank in April 2014
  44. 44. 44 Lustre  Parallel distributed file system, generally used for large-scale cluster computing  Widely used in TOP500 supercomputers  Max. volume size: 100 PB (production), over 16 EB (theoretical)  Max. file size: 2.5 PB (ext4), 16 EB (ZFS)  Started as a research project in 1999 by Peter Braam at CMU, who founded Cluster Filesystems Inc. in 2001 to work on Intermezzo, Coda and Lustre  First installed in March 2003 on the MCR Linux Cluster (Lawrence Livermore National Laboratory). Lustre 1.0.0 was released in December 2003.  Acquired by Sun Microsystems in 2007  Oracle acquired Sun in 2010 and discontinued the development  Whamcloud->Intel, OpenScalabaleFilesystems Inc. (OpenSFS), Xyratex Inc.
  45. 45. 45 Shameless plug: openATTIC  Unified Storage: manage XFS, ZFS, Btrfs, NFS, Samba  Modern GUI (AngularJS/Boostrap)  REST API  Built-in Monitoring  Clustering (Pacemaker/Corosync, DRBD)   Find us in the exhibition hall
  46. 46. 46 PHP-ENTWICKLER (M/W) mit Linux Know-how Sie entwickeln leidenschaftlich gerne und fühlen sich im Open Source-Umfeld Zuhause? Dann sollten wir uns kennenlernen! Diese Aufgaben erwarten Sie bei uns… • Entwicklung unseres Systemmonitoring-Tools openITCOCKPIT für Frontend und/oder Backend • Konzeption und Realisierung von Projekten in Teamarbeit • Testing der entwickelten Anwendungen • Pflege und Ausbau der bestehenden Entwicklungs- und Testumgebung Weitere Informationen finden Sie unter: Gesucht: PHP-Entwickler (m/w) mit Linux Know-How
  47. 47. Thank you!