Use Distributed Filesystem as a Storage Tier

6.216 visualizações

Publicada em

Storage is one of the most important part of a data center, the complexity to design, build  and  delivering 24/forever availability service continues to increase every year. For these problems one of the best solution is a distributed filesystem (DFS) This talk describes the basic architectures of DFS and comparison among different free software solutions in order to show what makes DFS suitable for large-scale distributed environments.   We explain how to use, to deploy, advantages and disadvantages, performance and layout on each solutions.  We also introduce some Case Studies on implementations based on openAFS, GlusterFS and Hadoop finalized to build your own Cloud Storage.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Use Distributed Filesystem as a Storage Tier

  1. 1. Use Distributed File system as a Storage TierFabrizioManfred Furuholmen
  2. 2. Agenda  Introduction  Next Generation Data Center  Distributed File system  Distributed File system  OpenAFS  GlusterFS  HDFS  Ceph  Case Studies  Conclusion 2 16/02/2012
  3. 3. Class Exam What do you know about DFS ? How can you create a Petabyte storage ? How can you make a centralized system log ? How can you allocate space for your user or system, when you have a thousands of users/systems ? How can you retrieve data from everywhere ? 3 16/02/2012
  4. 4. IntroductionNext Generation Data Center: the ―FABRIC‖Key categories: Continuous data protection and disaster recovery File and block data migration across heterogeneous environments Server and storage virtualization Encryption for data in-flight and at-restIn other words: Cloud data center 4 16/02/2012
  5. 5. IntroductionStorage Tier in the ―FABRIC‖ High Performance Scalability Simplified Management Security High AvailabilitySolutions Storage Area Network Network Attached Storage Distributed file system 5 16/02/2012
  6. 6. IntroductionWhat is a Distributed File system ?“A distributed file system takes advantage of the interconnected nature of the network by storing files on more than one computer in the network and making them accessible to all of them..” 6 16/02/2012
  7. 7. Introduction What do you expected from a distributed file system ?• Uniform Access: file names global support• Security: to provide a global authentication/authorization• Reliability: the elimination of each single point of failure• Availability: administrators perform routine maintenance while the file server is in operation, without disrupting the user’s routines• Scalability: Handle terabytes of data• Standard conformance: some IEEE POSIX file system semantics standard• Performance: high performance 7
  8. 8. Part IIImplementations How many DFS do you know ? 8
  9. 9. OpenAFS: introduction is theopen sourceimplementation of AndrewFile system of IBMKey ideas: Make clients do work whenever possible. Cache whenever possible. Exploit file usage properties. Understand them. One-third of Unix files are temporary. Minimize system-wide knowledge and change. Do not hardwire locations. Trust the fewest possible entities. Do not trust workstations. Batch if possible to group operations. 9 16/02/2012
  10. 10. OpenAFS: design 10 16/02/2012
  11. 11. OpenAFS: componentsCell•Cell is collection of file servers and workstation•The directories under /afs are cells, unique tree•Fileserver contains volumesVolumes•Volumes are "containers" or sets of related files and directories•Have size limit•3 type rw, ro, backupMount Point Directory Server A•Access to a volume is provided through a mount point Server C•A mount point is just like a static directory Server A+B 11
  12. 12. OpenAFS: performances OpenAFS OpenAFS OSD 2 Servers write400003500030000 35000-4000025000 30000-3500020000 25000-30000 20000-2500015000 15000-2000010000 10000-15000 16384 5000-10000 5000 1024 0-5000 0 block 64 64 256 1024 4096 16384 4 65536 262144 kb read 90000 80000 70000 80000-90000 60000 70000-80000 50000 60000-70000 50000-60000 40000 40000-50000 30000 30000-40000 20000 20000-30000 10000 10000-20000 131072 0-10000 0 16384 43 4 16 64 256 1024 2048 4096 16384 a
  13. 13. OpenAFS: features Uniform name space: same path on all workstations Security: base to krb4/krb5, extended ACL, traffic encryption Reliability: read-only replication, HA database, read/write replica in OSD version Availability: maintenance tasks without stopping the service Scalability: server aggregation Administration: administration delegation Performance: client side disk base persistent cache, big rate client per Server 13 16/02/2012
  14. 14. openAFS: who uses it ?Morgan Stanley IT• Internal usage• Storage: 450 TB (ro)+ 15 TB (rw)• Client: 22.000Pictage, Inc• Online picture album• Storage: 265TB ( planned growth to 425TB in twelve months)• Volumes: 800,000.• Files: 200 000 000.Embian• Internet Shared folder• Storage: 500TB• Server: 200 Storage server• 300 App serverRZH•Internal usage 210TB 14
  15. 15. OpenAFS: good for ... Good • Wide Area Network • Heterogeneous System • Read operation > write operation • Large number of clients/systems • Usage directly by end-users • Federation Bad • Locking • Database • Unicode • Large File • Some limitations on .. 15
  16. 16. GlusterFS“Gluster can manage data in a single global namespace on commodity hardware..‖Keys: Lower Storage Cost—Open source software runs on commodity hardware Scalability—Linearly scales to hundreds of Petabytes Performance—No metadata server means no bottlenecks High Availability—Data mirroring and real time self-healing Virtual Storage for Virtual Servers—Simplifies storage and keeps VMs always-on Simplicity—Complete web based management suite 16 16/02/2012
  17. 17. GlusterFS: design 17 16/02/2012
  18. 18. GlusterFS: componentsVolume volume posix1•Volume is the basic element for data type storage/posix export option directory /home/export1•The volumes can be stacked for end-volume extensionCapabilities volume brick1•Specific options (features) can be type features/posix-locks enabled for each volume (cache, pre option mandatory fetch, etc.) subvolumes posix1•Simple creation for custom extensions end-volume with api interfaceServices volume server type protocol/server•Access to a volume is provided through option transport-type tcp services like tcp, unix socket, option transport.socket.listen-port 6996 infiniband subvolumes brick1 option auth.addr.brick1.allow * end-volume 18 16/02/2012
  19. 19. Gluster: components 19 16/02/2012
  20. 20. Gluster: performance 20 16/02/2012
  21. 21. Gluster: carateristics Uniform name space: same path on all workstation Reliability: read-1 replication, asynchronous replication for disaster recovery Availability: No system downtime for maintenance (better in the next release) Scalability: Truly linear scalability Administration: Self Healing, Centralized logging and reporting, Appliance version Performance: Stripe files across dozens of storage blocks, Automatic load balancing, per volume i/o tuning 21 16/02/2012
  22. 22. Gluster: who uses it ? Avail TVN (USA)400TB for Video on demand, videostorage Fido Film (Sweden)visual FX and Animation studio University of Minnesota (USA)142TB Supercomputing Partners Healthcare (USA)336TB Integrated health systemOrigo(Switzerland)open source software developmentand collaboration platform 22
  23. 23. Gluster: good for ... Good • Large amount of data • Access with different protocols • Directly access from applications (api layer) • Disaster recover (better in the next release) • SAN replacement, vm storage Bad • User-space • Low granularity in security setting • High volumes of operations on same file 23
  24. 24. ImplementationsImplementationsOld way Metadata and data in the same place Single stream per fileNew way Multiple streams are parallel channels through which data can flow Files are striped across a set of nodes in order to facilitate parallel access OSD Separation of file metadata management (MDS) from the storage of file data 24 16/02/2012
  25. 25. HDFS: HadoopHDFS is part of the Apache Hadoopproject which develops open-source software for reliable, scalable, distributed computing.Hadoop was inspired by Google’s MapReduce and Google File system 25 16/02/2012
  26. 26. HDFS: Google File System― Design of a file systems for a different environment where assumptions of a general purpose file system do not hold—interesting to see how new assumptions lead to a different type of system…‖Key ideas: Component failures are the norm. Huge files (not just the occasional file) Append rather than overwrite is typical Co-design of application and file system API—specialization. For example can have relaxed consistency. 26 16/02/2012
  27. 27. HDFS: MapReduce “Moving Computation is Cheaper than Moving Data”Map• Split and mapped in key- value pairs Combine • For efficiency reasons, the combiner works directly to map operation outputs . Reduce • The files are then merged, sorted and reduced 27
  28. 28. HDFS: goals Scalable: can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers.Goals Efficient: can process data in parallel on the nodes where the data is located. Reliable: automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. 28
  29. 29. HDFS: design 29
  30. 30. HDFS: componentsNamenode• An HDFS cluster consists of a single NameNode• It is a master server that manages the file system namespace and regulates access to files by clients.Datanodes• Datanode manage storage attached to the system it run on• Applay the map rule of MapReduceBlocks• File is split into one or more blocks and these blocks are stored in a set of DataNodes 30
  31. 31. HDFS: features Uniform name space: same path on all workstations Reliability: rw replication, re-balancing, copy in different locations Availability: hot deploy Scalability: server aggregation Administration: HOD Performance: “grid” computation, parallel transfer 31 16/02/2012
  32. 32. HDFS: who uses it ? Yahoo! A9.com AOL Booz Allen Hamilton EHarmony Facebook Freebase Fox Interactive Media IBM ImageShack ISIMajor players Joost Last.fm LinkedIn Metaweb Meebo Ning Powerset (now part of Microsoft) Proteus Technologies The New York Times Rackspace Veoh Twitter … 32
  33. 33. HDFS: good for ... Good • Task distribution (Basic GRID infrastructure) • Distribution of content (High throughput of data access ) • Archiving • Etherogenous envirorment Bad • Not General purpose File system • Not Posix Compliant • Low granularity in security setting • Java 33
  34. 34. Ceph“Ceph is designed to handle workloadsin which tens thousands of clients ormore simultaneously access the samefile orwrite to the same directory–usage scenarios that bring typicalenterprise storage systems to theirknees.‖Keys: Seamless scaling — The file system can be seamlessly expanded by simply adding storage nodes (OSDs). However, unlike most existing file systems, Ceph proactively migrates data onto new devices in order to maintain a balanced distribution of data. Strong reliability and fast recovery — All data is replicated across multiple OSDs. If any OSD fails, data is automatically re-replicated to other devices. Adaptive MDS — The Ceph metadata server (MDS) is designed to dynamically adapt its behavior to the current workload. 34
  35. 35. Ceph: design • Client • MetadatOSD a Cluster • Object Storage Cluster 35
  36. 36. Ceph: featuresDynamic Distributed Metadata• Metadata Storage• Dynamic Subtree Partitioning• Traffic ControlReliable Autonomic Distributed ObjectStorage• Data Distribution• Replication• Data Safety• Failure Detection• Recovery and Cluster Updates 36
  37. 37. Ceph: featuresPseudo-random data distribution function (CRUSH)Reliable object storage service (RADOS)Extent B-tree object File System (today btrfs) 37
  38. 38. Ceph: featuresSplay Replication• Only after it has been safely committed to disk is a final commit notification sent to the client. 38
  39. 39. Ceph: good for … Good • Scientific application, High throughput of data access • Heavy Read / Write operations • It is the most advance distributed file system Bad • Young (Linux 2.6.34) • Linux only • Complex 39
  40. 40. OthersLustre PVFS MooseFSCloudstore PNFS …(kosmos) SearchXtreemFS Tahoe-LAFS Wikipedia.. 40
  41. 41. Part III Case Studies 41
  42. 42. Class Exam What can DFS do for you ? How can you create a Petabyte storage ? How can you make a centralized system log ? How can you allocate space for your user or system, when you have a thousands of users/systems ? How can you retrieve data from everywhere ? 42 16/02/2012
  43. 43. File sharingProblem•Share Documents across a wide network area•Share home folder across different Terminal serversSolution•OpenAFS•SambaResults•Single ID, Kerberos/ldap•Single file systemUsage•800 users•15 branch offices•File sharing /home dir 43
  44. 44. Web ServiceProblem• Big Storage on a little budgetSolution• GlusterResults• High Availability data storage• Low priceUsage• 100 TB image archive• Multimedia content for web site 44
  45. 45. Internet Disk: myS3Problems•Data from everywhere•Disaster RecoverSolution•myS3•Hadoop / OpenAFSResults•High Availability•Access through HTTP protocol (REST Interface)•Disaster RecoveryUsage•Users backup•Application backend•200 Users•6 TB 45
  46. 46. Log concentratorProblem• Log concentratorSolution• Hadoop cluster• Syslog-NGResults• High availability• Fast search• “Storage without limits”Usage• Security audit and access control 46
  47. 47. Private cloudProblems• Low cost VM storage• VM self provisioningSolution• GlusterFS• openAFS• Custom provisioningRresults• Auto provisioning• Low cost• Flexible solutionUsage• Development env• Production env
  48. 48. Conclusion: problemsDo you have enough bandwidth ? FailureFor 10 PB of storage, you will have anaverage of22consumer-grade SATA drivesfailing per day. Read/write timeEach of the 2TB drives takes approximatelybest case 24,390 seconds to be read andwritten over the network. Data ReplicationData replication is the number of the diskdrives, plus difference. 48 16/02/2012
  49. 49. ConclusionEnvironment Analysis• No true Generic DFS• Not simple move 800TB btw different solutions Dimension • Start with the right size • Servers number is related to speed needed and number of clients • Network for Replication Divide system in Class of Service • Different disk Type • Different Computer Type System Management • Monitoring Tools • System/Software Deploy Tools 49
  50. 50. Conclusion: next step 50 16/02/2012
  51. 51. LinksOpenAFS Gluster Hadoop Ceph• www.openafs.org • www.gluster.org • Hadoop.apache.org • ceph.newdream.n• www.beolink.org • Isabel Drost et • Publication • Mailing list 51
  52. 52. I look forwardto meeting you… XVII European AFS meeting 2010 PILSEN - CZECH REPUBLIC September 13-15 Who should attend:  Everyone interested in deploying a globally accessible file system  Everyone interested in learning more about real world usage of Kerberos authentication in single realm and federated single sign-on environments  Everyone who wants to share their knowledge and experience with other members of the AFS and Kerberos communities  Everyone who wants to find out the latest developments affecting AFS and Kerberos More Info: http://afs2010.civ.zcu.cz/ 52 16/02/2012
  53. 53. Thankyoumanfred@zeropiu.com

×