Active Data : Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion.
A key point is to handle the complexity of the 'Data Life Cycle', i.e. the various operations performed on data: transfer, archiving, replication, deletion, etc. Indeed, data-intensive applications span over a large variety of devices and e-infrastructures which implies that many systems are involved in data management and processing.
''Active Data'' is new approach to automate and improve the expressiveness of data management applications. It consists of
* a 'formal model' for Data Life Cycle, based on Petri Net, that allows to describe and expose data life cycle across heterogeneous systems and infrastructures.
* a 'programming model' allows code execution at each stage of the data life cycle: routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data.
Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures
1. Active Data: Data Life Cycle Management
Across Heterogeneous Systems and
Infrastructures
Anthony Simonet, Gilles Fedak (INRIA)
Matei Ripeanu, Samer Al-Kiswany (UCB)
Kyle Chard, Ian Foster (ANL/UC)
Hot Topics in High-Performance Distributed Computing Workshop
IBM Almaden Research Center
San Jose, California
March 12, 2015
1/13
G. Fedak() Active Data March 12, 2015
2. Big Data ...
Huge and growing volume of information originating from multiple
sources.
!"#$%&'("$#) *+'&,-./"#)0+1)*2+("2() 34(")5-$-)!"$(%"($)
. . . or Big Bottlenecks ?
how to scale the infrastructure ?
end-to-end performance improvement, inter-system optimization.
how to improve productivity of data-intensive scientist ?
data-oriented programming language, data quality, improve
automation and errors recovery
2/13
G. Fedak() Active Data March 12, 2015
3. Data Life Cycle
Definition
Data Life Cycle (DLC) is the course of operational stages through which
data pass from the time when they enter a set of systems to the time
when they leave it.
!"#$%&%'()* +,-.,("-&&%)/* 01(,2/-*
!)234&%&*
!)234&%&*
Challenges :
Expose high level view DLC across distributed systems and
infrastructures
Expose interactions between the infrastructure and the DLC (e.g
failures)
3/13
G. Fedak() Active Data March 12, 2015
4. Active Data
Active Data:
Allow to reason about data sets handled by heterogeneous software
and infrastructures.
A formal model that captures the essential life cycle stages and
properties: creation, deletion, faults, replication, error checking . . .
programming model to develop easily data life cycle management
applications.
Allows legacy systems to expose their intrinsic data life cycle.
4/13
G. Fedak() Active Data March 12, 2015
5. Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on Petri Nets.
A Life Cycle Model is made of
Places: data states
Transitions : data operations
•
Created
t1
Written
t2
Read
t3
t4
Terminated
Each token has a unique identifier, corresponding to the actual data
item’s.
5/13
G. Fedak() Active Data March 12, 2015
6. Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on Petri Nets.
A Life Cycle Model is made of
Places: data states
Transitions : data operations
Created
t1
•
Written
t2
Read
t3
t4
Terminated
A transition is fired whenever a data state changes.
5/13
G. Fedak() Active Data March 12, 2015
7. Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on Petri Nets.
A Life Cycle Model is made of
Places: data states
Transitions : data operations
Created
t1
•
Written
t2
Read
t3
t4
Terminated
public void handler () {
computeMD5 ();
}
Code may be plugged by clients to transitions.
It is executed whenever the transition is fired.
5/13
G. Fedak() Active Data March 12, 2015
8. Active Data Framework
6/13
G. Fedak() Active Data March 12, 2015
Life Cycle View
File transferFile Dataset Metadata
Guard
Code Execution
}Tagged Tokens
Notification
Framework features:
Captures data events in legacy systems
High-level life cycle-centered view of data
Single namespace for all the files,
datasets and metadata
Powerful filters based on Data Tags
Install Taggers on Transitions
Guarded Transitions : only executes on
token which have specific tags.
Publish/subscribe transitions
Custom user reaction to data progress
Custom code execution
Custom notifications (twitter, email,
gdoc, ifttt . . . )
9. Use Case: Advanced Photon Source
Globus Catalog Globus
Detector Local Storage Compute Cluster
1. Local
Transfer
2. Extract
Metadata
3. Globus
Transfer
4. Swift Parallel Analysis
3 to 5 TB of data per week on this detector
Raw data are pre-processed and registered in the Globus Catalog :
Data are curated by several applications
Data are shared amongst scientific user
7/13
G. Fedak() Active Data March 12, 2015
10. Data Surveillance Framework
4 goals (that would otherwise require a lot of scripting and hacking):
Monitoring Data Set Progress
Better Automation
Sharing & Notification
Error Discovery & Recovery
8/13
G. Fedak() Active Data March 12, 2015
11. APS Data Life Cycle Model
Created Start transfer
Terminated
End
Detector
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
End transfer
Globus transfer
Created
End
Terminated
Start transfer
Shared storage
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
Start Swift
Globus transfer
CreatedExtract
Update
TerminatedRemove
Globus Catalog
Created
Initialize
Set
End
Failure
Terminated
Derive
Swift
Data life cycle model composed of 6 systems.
9/13
G. Fedak() Active Data March 12, 2015
12. Error Detection & Recovery
10/13
G. Fedak() Active Data March 12, 2015
Example scenario
Recover from system-wide errors: faulty acquired files are detected only
after Swift fails to process them.
In this situation, the user manually:
Drops the whole dataset
Removes any associated file and metadata
Re-acquire the dataset using the same parameters
13. E.D.&R. implementationAvalon Daniel Arnaud Anthony Vincent
Use-case: APS data life cycle model
Created Start transfer
Terminated
End
Detector
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
End transfer
Globus transfer
Created
End
Terminated
Start transfer
Shared storage
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
Start Swift
Globus transfer
CreatedExtract
Update
TerminatedRemove
Globus Catalog
Created
Initialize
Set
End
Failure
Terminated
Derive
Swift
Data life cycle model composed of 6 systems.
Avalon June 3rd, 2014 22/30
Active Data Client
Failure
and
Recovery
Handler
remove metadata from the catalog
Filters
data
likely to
fail
Tagger
Active Data Client
Guard
Handler
run the Globus
catalog UI scripts
11/13
G. Fedak() Active Data March 12, 2015
14. Handler Code
TransitionHandler handler = new TransitionHandler () {
public void handler(Transition t, boolean isLocal , Token [] inTokens , Token [] outTokens) {
// Get the dataset identifier
LifeCycle lc = ad. getLifeCycle (inTokens [0]);
datasetId = lc.getTokens("Shared storage.Created")[0]. getUid ();
// Remove the dataset annotations from the catalog
String url = "https :// catalog.globus.org/dataset/" + datasetId;
Runtime r = Runtime.getRuntime ();
Process p = r.exec(" catalog_client .py remove " + url);
p.waitFor ();
// Locally , remove the datasets
String path = "~/aps/" + datasetId;
FileUtils. deleteDirectory (new File(path));
// Publish the " Detector .End"
Token root = lc.getTokens("Detector.Created")[0];
ad. publishTransition ("Detector.End", lc);
// Notify the user
sendEmail("user@server.com", "APS - Corrupted dataset " + datasetId);
}
};
HandlerGuard guard = new HandlerGuard () {
public boolean accept ( Transition t , Token [] inTokens , Token [] outTokens ) {
return input [0]. hasTag(" f a i l u r e c o r r u p t e d ");
}}
ad.subscribeTo("Swift.Failure", handler , guard);
12/13
G. Fedak() Active Data March 12, 2015
15. Conclusion
Active Data
allows to expose Data Life Cycle across heterogeneous systems and
infrastructures
transition-based programming model for DLC management
application
Monitoring, automation, error detection & recovery
X-systems optimizations: incremental computing, data staging,
caching, throttling etc. . .
Perspectives :
Use AD to deploy data management software stack on IaaS (Asma
Ben Cheick, Heithem Abbes, Univ. Tunis)
Big Data Apache stack X-optimization (H. He, CAS, Beijing)
Volunteer & crowd computing (M. Moca, BBU, Romania)
13/13
G. Fedak() Active Data March 12, 2015