Architecture of a Next-Generation Parallel File System

566 visualizações

Publicada em

Great Wide Open - Day 1
Boyd Wilson - Omnibond
4:30 PM - Emerging

Publicada em: Tecnologia
0 comentários
0 gostaram
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Sem downloads
Visualizações totais
No SlideShare
A partir de incorporações
Número de incorporações
Incorporações 0
Nenhuma incorporação

Nenhuma nota no slide

Architecture of a Next-Generation Parallel File System

  1. 1. Architecture  of  a     Next-­‐Generation  Parallel  File  System  
  2. 2. -­‐  Introduction   -­‐  Whats  in  the  code  now   -­‐  futures   Agenda  
  3. 3. An  introduction  
  4. 4. What  is  OrangeFS?   •  OrangeFS  is  a  next  generation  Parallel  File  System   •  Based  on  PVFS   •  Distributes  file  data  across  multiple  file  servers   leveraging  any  block  level  file  system.   •  Distributed  Meta  Data  across  1  to  all  storage  servers     •  Supports  simultaneous  access  by  multiple  clients,   including  Windows  using  the  PVFS  protocol  Directly   •  Works  w/  standard  kernel  releases  and  does  not  require   custom  kernel  patches   •  Easy  to  install  and  maintain  
  5. 5. Why  Parallel  File  System?   HPC  –  Data  Intensive   Parallel  (PVFS)  Protocol   •  Large  datasets   •  Checkpointing   •  Visualization   •  Video   •  BigData     Unstructured  Data  Silos   Interfaces  to  Match  Problems   •  Unify  Dispersed  File  Systems   •  Simplify  Storage  Leveling   §  Multidimensional  arrays   §   typed  data     §  portable  formats  
  6. 6. Original  PVFS  Design  Goals   §  Scalable   §  Configurable  file  striping   §  Non-­‐contiguous  I/O  patterns   §  Eliminates  bottlenecks  in  I/O  path   §  Does  not  need  locks  for  metadata  ops   §  Does  not  need  locks  for  non-­‐conflicting  applications   §  Usability   §  Very  easy  to  install,  small  VFS  kernel  driver   §  Modular  design  for  disk,  network,  etc   §  Easy  to  extend  -­‐>  Hundreds  of  Research  Projects  have   used  it,  including  dissertations,  thesis,  etc…  
  7. 7. OrangeFS  Philosophy     •  Focus  on  a  Broader  Set  of  Applications   •  Customer  &  Community  Focused   •  (>300  Member  Strong  Community  &  Growing)   •  Open  Source   •  Commercially  Viable   •  Enable  Research    
  8. 8. Configurability   Performance   Consistency  Reliability  
  9. 9. System  Architecture   •  OrangeFS  servers  manage  objects   •  Objects  map  to  a  specific  server   •  Objects  store  data  or  metadata   •  Request  protocol  specifies  operations  on  one  or   more  objects     •  OrangeFS  object  implementation   •  DB  for  indexing  key/value  data   •  Local  block  file  system  for  data  stream  of  bytes  
  10. 10. Current  Architecture   Client   Server  
  11. 11. 1994-­‐2004   Design  and  Development  at   CU  Dr.  Ligon  +  ANL  (CU   Graduates)   Primary  Maint  &  Development   ANL  (CU  Graduates)  +  Community   2004-­‐2010   2007-­‐2010   New  PVFS  Branch   SC10  (fall  2010)   2015   Announced  with  community  and  is  now     Mainline  of  future  development  as  of  2.8.4     Spring  2012   New  Development  focused  on   a  broader  set  of  problems   SC11  (fall  2011)   Performance  improvements,     Direct  Lib  +  Cache   Stability,  WebDAV,  S3   PVFS2   PVFS   Improved  MD,  Stability,   Server  Side  Operations,   Newer  Kernels,  Testing   Windows  Client,  Stability,   Replicate  on  Immutable   2.8.6  +  Webpack   2.8.5  +  Win   Support  and  Targeted  Development  Services   Initially  Offered  by  Omnibond   OrangeFS    3.0   Summer  2014   Distributed  Dir  MD,   Capability  based  security  2.9.0   Winter  2013   Performance  improvements,     Stability,  2.8.7  +  Webpack   Spring  2014   Performance  improvements,  Stability,  shared  mmap,  multi  TCP/IP  Server   Homing,  Hadoop    MapReduce,  user  lib  fixes,  new  spec  file  for  RPMS  +  DKMS  2.8.8  +  Webpack   Available  in  the  AWS  Marketplace   Replicated  MD,  File  Data,  128  bit  UUID  for  File  Handles,  Parallel  Background   Processes,  web  based  Mgt  Ui,  self  healing  processes,  data  balancing  
  12. 12. In  the  Code  now  
  13. 13. Server  to  Server  Communications  (2.8.5)     Traditional  Metadata   Operation    Create  request  causes  client  to   communicate  with  all  servers  O(p)   Scalable  Metadata   Operation     Create  request  communicates  with  a  single   server  which  in  turn  communicates  with   other  servers  using  a  tree-­‐based  protocol   O(log  p)   Mid   Client   Serv   App   Mid  Mid   Client   Serv   Client   Serv   App   Network   App   Mid   Client   Serv   App   Mid  Mid   Client   Serv   Client   Serv   App   Network   App  
  14. 14. Recent  Additions  (2.8.5)   SSD  Metadata  Storage   Replicate  on  Immutable  (file  based)   Windows  Client   Supports  Windows  32/64  bit   Server  2008,  R2,  Vista,  7  
  15. 15. Direct  Access  Interface    (2.8.6)   •  Implements:   •  POSIX  system  calls   •  Stdio  library  calls   •  Parallel  extensions   •  Noncontiguous  I/O   •  Non-­‐blocking  I/O   •  MPI-­‐IO  library   •  Found  more  boundary   conditions  fixed  in   upcoming  2.8.7   App   Kernel   PVFS  lib   Client  Core   Direct  lib   PVFS  lib   Kernel   App   IB   TCP  
  16. 16. File  System  File  System   File  System   Direct  Interface  Client  Caching    (2.8.6)     •  Direct  Interface  enables  Multi-­‐Process   Coherent  Client  Caching  for  a  single  client   File  System   File  System   Client  Application   Direct  Interface     Client  Cache   File  System  
  17. 17. WebDAV  (2.8.6  webpack)   PVFS  Protocol   OrangeFS   Apache   •  Supports  DAV  protocol  and  tested  with  the  Litmus  DAV  test   suite   •  Supports  DAV  cooperative  locking  in  metadata    
  18. 18. S3  (2.8.6  webpack)   PVFS  Protocol   OrangeFS   Apache   •  Tested  using  s3cmd  client   •  Files  accessible  via  other  access  methods   •  Containers  are  Directories   •  Accounting  Pieces  not  implimented  
  19. 19. Summary  -­‐  Recently  Added  to  OrangeFS   •  In  2.8.3   •  Server-­‐to-­‐Server  Communication   •  SSD  Metadata  Storage   •  Replicate  on  Immutable   •  2.8.4,  2.8.5  (fixes,  support  for  newer  kernels)   •  Windows  Client   •  2.8.6  –  Performance,  Fixes,  IB  updates   •  Direct  Access  Libraries  (initial  release)   •  preload  library  for  applications,  Including  Optional  Client   Cache   •  Webpack   •  WebDAV  (with  file  locking),  S3  
  20. 20. Available  on  the    Amazon  AWS  Marketplace  and  brought  to  you  by  Omnibond   OrangeFS   Instance   Unified High Performance File System DynamoDB   EBS   Volumes   OrangeFS  on  AWS  Marketplace  
  21. 21. In  2.8.8  (Just  Released)  
  22. 22. Hadoop  JNI  Interface  (2.8.8)   •  OrangeFS  Java  Native   Interface   •  Extension  of  Hadoop  File   System  Class  –>JNI   •  Buffering   •  Distribution   •  Fast  PVFS  Protocol   for  Remote   Configuration   PVFS  Protocol  
  23. 23. Additional  Items(2.8.8)   •  Updated  user  lib   •  Shared  mmap  support  in  kernel  module   •  Support  for  kernels  up  to  3.11   •  Multi-­‐homing  servers  over  IP   •  Clients  can  access  server  over  multiple  interfaces  (say   clients  on  IPoIB  +clients  on  IPoEthernet  +clients  on  IPoMx   •  Enterprise  Installers  (Coming  Shortly)   •  Client  (with  DKMS  for  Kernel  Module)   •  Server     •  Devel  
  24. 24. Performance  
  25. 25. Scaling  Tests   16  Storage  Servers  with  2  LVM’d  5+1  RAID  sets  were  tested  with  up   to  32  clients,  with  read  performance  reaching  nearly  12GB/s  and   write  performance  reaching  nearly  8GB/s.  
  26. 26. MapReduce  over  OrangeFS   •  8  Dell  R720  Servers  Connected  with  10Gb/s  Ethernet   •  Remote  Case  adds  an  additional  8  Identical  Servers  and   does  all  OrangeFS  work  Remotely  and  only  Local  work  is   done  on  Compute  Node  (Traditional  HPC  Model)   •  *25%  improvement  with  OrangeFS  running  Remotely    
  27. 27. MapReduce  over  OrangeFS   •  8  Dell  R720  Servers  Connected  with  10Gb/s  Ethernet   •  Remote  Clients  are  R720s  with  single  SAS  disks  for  local   data  (vs.  12  disk  arrays  in  the  previous  test).  
  28. 28. OrangeFS   Clients   SC13  Demo  Overview     OrangeFS   Clients   16  Dell  R720   OrangeFS   Servers   SC13  Floor   •  Clemson   •  USC   •  I2   •  Omnibond   I2  Innovation   Platform   100Gb/s  
  29. 29. Sc13  WAN  Performance   Multiple  Concurrent  Client  File  Creates  over  PVFS  protocol  (nullio)  
  30. 30. For  2.9  (summer  2014)  
  31. 31. Distributed  Directory  Metadata  (2.9.0)   DirEnt1   DirEnt2   DirEnt3   DirEnt4   DirEnt5   DirEnt6   DirEnt1   DirEnt5   DirEnt3   DirEnt2   DirEnt6   DirEnt4   Server0   Server1   Server2   Server3   Extensible Hashing   u  State  Management  based  on  Giga+   u  Garth  Gibson,  CMU   u  Improves  access  times  for  directories  with   a  very  large  number  of  entries  
  32. 32. Cert  or  Credential   Signed  Capability  I/O   Signed  Capability   Signed  Capability  I/O   Signed  Capability  I/O   OpenSSL   PKI   •  3  Security  Modes   •  Basic  –  OrangeFS/PVFS  Classic  Mode   •  Key-­‐Based  –  Keys  are  used  to  authorize  clients  for  use  with  the  FS   •  User  Certificate  Based  with  LDAP  –  user  certs  are  used  for  access  to   the  file  system  and  are  generated  based  on  LDAP  uid/gid  info  
  33. 33. For  v3  
  34. 34. Replication  /  Redundancy    •  Redundant  Metadata   •  seamless  recovery  after  a  failure   •  Redundant  objects  from  root  directory  down   •  Configurable   •  Redundant  Data                    Update  mode  (real  time,  on  close,  on  immutable,  none)                    Configurable  Number  of  Replicas   •  Real  Time  “forked  flow”  work  shows  little  overhead   •  Replicate  on  Close   •  Replicate  to  external  (like  LTFS)   •  Looking  at  supporting  HSM  option  to  external            (no  local  replica)   •  Emphasis  on  continuous  operation   OrangeFS  3.0  
  35. 35. •  An  OID  (object  identifier)  is  a  128-­‐bit  UUID  that   is  unique  to  the  data-­‐space   •  An  SID  (server  identifier)  is  a  128-­‐bit  UUID  that   is  unique  to  each  server.   •  No  more  than  one  copy  of  a  given  data-­‐space   can  exist  on  any  server   •  The  (OID,  SID)  tuple  is  unique  within  the  file   system.   •  (OID,  SID1),    (OID,  SID2),  (OID,  SID3)  are  copies   of  the  object  identifier  on  different  servers.   Handles  -­‐>  UUIDs   OrangeFS  3.0  
  36. 36. •  In  an  Exascale  environment  with  the  potential  for     thousands  of  I/O  servers,  it  will  no  longer  be  feasible   for  each  server  to  know  about  all  other  servers.   •  Servers  Discovery   •  Will  know  a  subset  of  their  neighbors  at  startup  (or  may   be  cached  from  previous  startups).      Similar  to  DNS   domains.   •  Servers  will  learn  about  unknown  servers  on  an  as  needed   basis  and  cache  them.    Similar  to  DNS  query  mechanisms   (root  servers,  authoritative  domain  servers).   •  SID  Cache,  in  memory  db  to  store  server  attributes   Server  Location  /  SID  Mgt   OrangeFS  3.0  
  37. 37. Policy  Based  Location   •  User  defined  attributes  for  servers  and   clients   •  Stored  in  SID  cache   •  Policy  is  used  for  data  location,  replication   location  and  multi-­‐tenant  support     •  Completely  Flexible   •  Rack   •  Row   •  App   •  Region     OrangeFS  3.0  
  38. 38. •  Modular  infrastructure  to  easily  build   background  parallel  processes  for  the  file   system   Used  for:   •    Gathering  Stats  for  Monitoring   •  Usage  Calculation  (can  be  leveraged  for  Directory   Space  Restrictions,  chargebacks)   •  Background  safe  FSCK  processing  (can  mark  bad   items  in  MD)   •  Background  Checksum  comparisons     •  Etc…     Background  Parallel  Processing  Infrastructure   (3.0)  
  39. 39. Admin  REST  interface  /  Admin  UI     PVFS  Protocol   OrangeFS   Apache   REST   (3.0)  
  40. 40. Data  Migration  /  Mgt   •  Built  on  Redundancy  &  DBG  processes   •  Migrate  objects  between  servers   •  De-­‐populate  a  server  going  out  of  service   •  Populate  a  newly  activated  server  (HW  lifecycle)   •  Moving  computation  to  data   •  Hierarchical  storage   •  Use  existing  metadata  services   •  Possible  -­‐  Directory  Hierarchy  Cloning   •   Copy  on  Write  (Dev,  QA,  Prod  environments  with  high  %  data   overlap)   OrangeFS  3.x  
  41. 41. Hierarchical  Data  Management   Archive   Intermediate   Storage   NFS   Remote   Systems   exceed,  OSG,   Lustre,  GPFS,     Ceph,  Gluster   HPC   OrangeFS   Metadata   OrangeFS   Users   OrangeFS  3.x  
  42. 42. Attribute  Based  Metadata  Search     •  Client  tags  files  with  Keys/Values     •   Keys/Values  indexed  on  Metadata  Servers   •   Clients  query  for  files  based  on  Keys/Values   •   Returns  file  handles  with  options  for  filename   and  path   Key/Value  Parallel  Query   Data   Data   File  Access   OrangeFS  3.x  
  43. 43. Beyond  OrangeFS  NEXT  
  44. 44. Extend  Capability  based  security   •  Enable  certificate  level  access  (in  process)   •  Federated  access  capable   •  Can  be  integrated  with  rules  based  access   control   •  Department  x  in  company  y  can  share  with   Department  q  in  company  z     •  rules  and  roles  establish  the  relationship   •  Each  company  manages  their  own  control  of  who  is  in   the  company  and  in  department  
  45. 45. SDN  -­‐  OpenFlow   •  Working  with  OF  research  team  at  CU   •  OF  separates  the  control  plane  from  delivery,   gives  ability  to  control  network  with  SW   •  Looking  at  bandwidth  optimization   leveraging  OF  and  OrangeFS  
  46. 46. ParalleX   ParalleX  is  a  new  parallel  execution  model   •  Key  components  are:   •  Asynchronous  Global  Address  Space  (AGAS)   •  Threads   •  Parcels  (message  driven  instead  of  message  passing)   •  Locality     •  Percolation     •  Synchronization  primitives   •  High  Performance  ParalleX  (HPX)  library   implementation  written  in  C++  
  47. 47. PXFS   •  Parallel  I/O  for  ParalleX  based  on  PVFS   •  Common  themes  with  OrangeFS  Next   •  Primary  objective:  unification  of  ParalleX  and  storage   name  spaces.   •  Integration  of  AGAS  and  storage  metadata  subsystems   •  Persistent  object  model   •  Extends  ParalleX  with  a  number  of  IO  concepts   •  Replication   •  Metadata   •  Extending  IO  with  ParalleX  concepts   •  Moving  work  to  data   •  Local  synchronization   •  Effort  with  LSU,  Clemson,  and  Indiana  U.   •  Walt  Ligon,  Thomas  Sterling  
  48. 48. Community  
  49. 49. Johns  Hopkins  OrangeFS  Selection   •  JHU  -­‐  HLTCOE  Selected  OrangeFS   •  After  evaluating:  Ceph,  GlusterFS,  Lustre  and   OrangeFS   “Leveraging  OrangeFS  for  the   parallel  filesystem,  the  system  as  a  whole  is  capable   of  delivering  30GB/s  write,  46GB/s  read,  and  be-­‐   tween  37,260-­‐237,180  IOPS  of  performance.  The   variation  in  IOPS  performance  is  dependent  on  the   file  size  and  number  of  bytes  written  per  commit  as   documented  in  the  Test  Results  section.”*   *   “The  final  system  design  rep-­‐   resents  a  2,775%  increase  in  read  performance  and   a  1,763-­‐11,759%  increase  in  IOPS”*  
  50. 50. Learning  More   •  web  site   •  Releases   •  Documentation   •  Wiki   •  pvfs2-­‐users@beowulf-­‐   •  Support  for  users   •  pvfs2-­‐developers@beowulf-­‐   •  Support  for  developers  
  51. 51. Support  &  Development  Services   •  &   •  Professional  Support  &  Development  team   •  Buy  into  the  project    
  52. 52.   Intelligent   Transportation   Solutions     Identity  Manager   Drivers  &  Sentinel   Connectors       Parallel  Scale-­‐Out   Storage  Software     Social  Media   Interaction  System   Omnibond  Info   Computer  Vision   Enterprise   Personal   Solution  Areas  
  53. 53. Insert  Discussion  Here