Learn how UNC’s DICE cyber-infrastructure supports data grids for sharing data, digital libraries for publishing data, archives for preserving data, and data processing pipelines for analyzing data using the integrated Rule-Oriented Data.
1. Data Management at UNC-CH
(DICE Center)
Reagan W. Moore
Arcot Rajasekar
Mike Conway
Antoine de Torcy
Chien-Yi Hou
Leesa Brieger
Jewel Ward
Terrell Russell
Hao Xu
{moore,sekar,mwan}@diceresearch.org
http://irods.diceresearch.org
Wayne Schroeder
Mike Wan
Bing Zhu
Sheau-Yen Chen
Paul Tooby
2. Data Management Challenges
• TUCASI data grid
– Shared collections between UNC / NCSU / Duke
• RENCI data grid
– Regional data grid for distributing data (NC orthophotos)
• NCSU / RENCI cloud
– Couple cloud storage to institutional repository
• NCDC data grid
– Data processing pipeline between NCDC and RENCI
• SILS Lifetime Library
– Personal digital collection
• Carolina Digital Repository
– Institutional archive for data collections
• ITS Archive
– Replication across tape-based storage
• NARA Persistent Archive
– Preservation prototype
3. Common Infrastructure
• The same policy-based data management system is
used to support all of the data management
applications.
– Observe that the primary difference between the
applications is the set of policies and procedures that are
enforced.
– Requires the ability to:
• Turn policies into computer executable rules
• Use rules to control execution of procedures
• Track execution of procedures through stae information
4. 4
4
Policy-based Data Management
integrated Rule Oriented Data System
• Purpose - reason a collection is assembled
• Properties - attributes needed to ensure the purpose
• Policies - controls for enforcing desired properties
• Procedures - functions that implement the policies
• State information - results of applying the procedures
• Assessment criteria - validation that state information conforms to
the desired purpose
• Federation - controlled sharing of logical name spaces
These are the essential elements for policy-based data management
6. Policies
• Data distribution
• Data replication
• Extraction of required metadata
• Integrity verification
• Time-dependent access restrictions
• Versioning
• Deletion control
• Retention/disposition
• Creation of derived data products
7. 7
User w/Client
Can Search, Access, Add and
Manage Data
& Metadata
Access distributed data with Web-based Browser or iRODS GUI or Command Line clients.
Overview of iRODS Architecture
iRODS Data
Server
Disk, Tape, etc.
iRODS
Metadata
Catalog
Track information
iRODS Data System
iRODS Rule
Engine
Tracks Policies
8. Policy-Based Data Management
• Observe that each application uses a different user
interface
– Requires that the access mechanisms be decoupled from
the data management framework
• Observe that each application uses different types of
storage systems
– Requires that the system interoperate with multiple
storage protocols
• Implemented as infrastructure independence
9. Infrastructure IndependenceInfrastructure Independence
Storage SystemStorage System
Storage ProtocolStorage Protocol
Access InterfaceAccess Interface
Policy Enforcement PointsPolicy Enforcement Points
Standard Micro-servicesStandard Micro-services
Map from the actions
requested by the client to
multiple policy enforcement
points.
Map from policy to standard
micro-services.
Map from micro-services to
standard Posix I/O operations.
Map from standard Posix I/O
operations to the protocol
supported by the storage
system
Standard I/O OperationsStandard I/O Operations
DataGrid
11. Lessons Learned (NARA Prototype)
• Scalability / Reliability / Robustness
– Maintain interactivity through appropriate catalog indexing
– Manage network outages at the client level
– Minimize memory used
• Assessment / Audit trails
– Query metadata to validate properties
– Map audit trails to standard events
– Track actions from policy enforcement to procedure execution
• Federation / Extensibility / Ease of Use
– Replicate metadata and records to a deep archive
– Extensible architecture that enables software evolution
– Provide simplest possible mapping from policies to record series
12. Level of Sophistication
• How many different access mechanisms are
needed?
• How many different policy enforcement
points should be checked?
• How many standard functions are needed to
implement preservation procedures?
• How many preservation metadata attributes
are needed to track preservation properties?
• How many preservation policies are needed?
13. Data Grid Clients (35)
API Client Developer
Browser
DCAPE UNC
iExplore DICE-Bing Zhu
JUX IN2P3
Peta Web browser PetaShare
Digital Library
Akubra/iRODS DICE
Dspace MIT
Fedora on Fuse IN2P3
Fedora/iRODS module DICE
Islandora DICE
File System
Davis - Webdav ARCS
Dropbox / iDrop DICE-Mike Conway
FUSE IN2P3, DICE,
FUSE optimization PetaShare
OpenDAP ARCS
PetaFS (Fuse) Petashare - LSU
Petashell (Parrot) PetaShare
Grid
GridFTP - Griffin ARCS
Jsaga IN2P3
Parrot Notre Dame-Doug Thain
Saga KEK
API Client Developer
I/O Libraries
PHP - DICE DICE-Bing Zhu
C API DICE-Mike Wan
C I/O library DICE-Wayne Schroeder
Jargon DICE-Mike Conway
Pyrods - Python SHAMAN-Jerome Fusillier
Portal EnginFrame NICE / RENCI
Preservation
Arch DICE-Mike Conway
Tools
Archive tools-NOAO NOAO
Big Board visualization RENCI
File-format-identifier GA Tech
icommands DICE
Pcommands PetaShare
Resource Monitoring IN2P3
Sync-package Academica Sinica
URSpace Academica Sinica
Web Service
VOSpace NVOA
Shibboleth King's College
Workflows Kepler DICE
Stork LSU
Taverna RENCI
20. Create and Apply Policies
• Maintain a repository of rules in iRODS
– Rules characterized as XML files with input parameters explicitly
characterized as rule parameters
• Build policy template and save in policy repository
– Select which rule sets will be applied
• Map policy template to a record series (collection)
– Add policy parameters as metadata on the record series
• On ingestion of a file into the record series:
– Retrieve policies and parameters from the record series metadata
– Invoke associated rules and apply procedures
• This decouples the choice of access interface from the
execution of your policies
21. iDrop Client - Ingesting a Record
QuickTime™ and a
decompressor
are needed to see this picture.
Data Grid Record Series
22. Arch - Defining a Policy
QuickTime™ and a
decompressor
are needed to see this picture.
23. Policies on a Record Series
QuickTime™ and a
decompressor
are needed to see this picture.
28. Supported Storage Systems
• File systems - Windows, Linux, Mac
• Tape archives - HPSS, Sam-QFS
• Repositories - Flickr, Web sites
• Cloud storage- Amazon S3, EC2
• Relational database - PostgreSQL, Oracle, mySQL
• Under development
– Table driven resources
29. Modular Architecture
• Authentication
– GSI (PKI), Kerberos, Shibboleth, Challenge-response
• Authorization
– Roles, user groups, resource groups, policy constraints, ACLs
• Transport
– TCP/IP (parallel I/O streams), Reliable Blast UDP
• Metadata catalog
– PostgreSQL, mySQL, Oracle
• Distributed rule engine
– Scheduler, messaging system, execution engine, rule base
• Scales to 200 million files, multiple petabytes of data
30. Enforce NSF Data Management Plan
• Policy for selecting data for organization into a collection
• Policy for choice of descriptive metadata
• Policy for data sharing (how long the data are held proprietary)
• Choice of access services provided for the data
• Policy for creation of a reference collection (with data formats
and analysis procedures)
• Policy for retention and disposition
• Policy for commitment for storage support by project /
department / institution
31. 31
iRODS is a "coordinated NSF/OCI-Nat'l Archives research activity" under the
auspices of the President's NITRD Program and is identified as among the priorities
underlying the President's 2011 Budget Supplement in the area of Human and
Computer Interaction Information Management technology research.
Reagan W. Moore
rwmoore@renci.org
http://irods.diceresearch.org
NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype”
NSF SDCI-0721400 “Data Grids for Community Driven Applications”