Meandre: Semantic-Driven Data-Intensive Flows in the Clouds
1. Meandre: !
Semantic-Driven Data-Intensive !
Flows in the Clouds
Xavier Llorà!
National Center for Supercomputing Applications!
University of Illinois at Urbana-Champaign!
xllora@illinois.edu
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
3. SEASR: Design Goals
• Transparency
– From a single laptop to a HPC cluster
– Not bound to a particular computation fabric
– Allow heterogeneous development
• Intuitive programming paradigm
– Modular Components assembled into Flows
– Foster Collaboration and Sharing
• Open Source
• Service Orientated Architecture (SOA)
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
4. Meandre: Infrastructure
• SEASR/Meandre Infrastructure:
– Dataflow execution paradigm
– Semantic-web driven
– Web oriented
– Supports publishing services
– Promotes reuse, sharing, and collaboration
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
5. Meandre: Data Driven Execution
• Execution Paradigms
– Conventional programs perform computational tasks by
executing a sequence of instructions.
– Data driven execution revolves around the idea of
applying transformation operations to a flow or stream
of data when it is available.
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
6. Meandre: Dataflow Example
Value1
Sum
Value2
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
7. Meandre: Dataflow Example
• Dataflow Addition Example
– Logical Operation ‘+’
Value1
– Requires two inputs
Sum
– Produces one output
Value2
• When two inputs are available
– Logical operation can be preformed
– Sum is output
• When output is produced
– Reset internal values
– Wait for two new input values to become available
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
8. Meandre: The Dataflow Component
• Data dictates component execution semantics
Inputs Outputs
Component
P
Descriptor in RDF! The component !
of its behavior
implementation
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
9. Meandre: Data Driven Execution
• Dataflow Approach
– May have zero to many inputs
– May have zero to many outputs
– Performs a logical operation when data is available
• The component define its firing policy
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
10. Meandre: Component Metadata
• Describes a component
• Separates:
– Components semantics (black box)
– Components implementation (Java, Python, Lisp)
• Provides a unified framework:
– Basic building blocks or units (components)
– Complex tasks (flows)
– Standardized metadata
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
11. Meandre: Semantic Web Concepts
• Relies on the usage of the resource description framework
(RDF)
• Provides a common framework to share and reuse data
across application, enterprise, and community boundaries
• Focuses on common formats for integration and combination
of data drawn from diverse sources
• Pays special attention to the language used for recording how
the data relates to real world objects
• Allows navigation to sets of data resources that are
semantically connected.
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
12. Meandre: Metadata Ontologies
• Meandre's metadata relies on three ontologies:
– The RDF ontology serves as a base for defining
Meandre descriptors
– The Dublin Core Elements ontology provides basic
publishing and descriptive capabilities in the description
of Meandre descriptors
– The Meandre ontology describes a set of relationships
that model valid components, as understood by the
Meandre execution engine architecture
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
13. Meandre: The Dataflow Component
• Data dictates component execution semantics
Inputs Outputs
Component
P
Descriptor in RDF! The component !
of its behavior
implementation
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
14. Meandre: Components Types
• Components are the basic building block of any
computational task.
• There are two kinds of Meandre components:
– Executable components
• Perform computational tasks that require no human
interactions during runtime
• Processes are initialized during flow startup and are fired when
in accordance to the policies defined for it.
– Control components
• Used to pause dataflow during user interaction cycles
• WebUI may be a HTML Form, Applet, or Other user interface
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
15. Wrapping With Components
• Component provides inputs, outputs, properties
• You code
– Inside!
– Call from!
– A WS front end
– Interactive application
– Request/response cycles
16. Meandre: Flow (Complex Tasks)
• A flow is a collection of connected components
Read
Merge
P
P
Show
Get
P
P
Do
P
Dataflow execution
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
17. Meandre: Programming Paradigm
• The programming paradigm creates complex
tasks by linking together a bunch of specialized
components. Meandre's publishing mechanism
allows components develop by third parties to be
assembled in a new flow.
• There are two ways to develop flows :
– Meandre’s Workbench visual programming tool
– Meandre’s ZigZag scripting language
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
18. Meandre: Workbench Existing Flow
Components
Flows
Locations
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
19. Meandre: ZigZag Script Language
• ZigZag is a simple language for describing data-
intensive flows
– Modeled on Python for simplicity.
– ZigZag is declarative language for expressing the
directed graphs that describe flows.
• Command-line tools allow ZigZag files to compile
and execute.
– A compiler is provided to transform a ZigZag program
(.zz) into Meandre archive unit (.mau).
– Mau(s) can then be executed by a Meandre engine.
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
20. Meandre: ZigZag Script Language
• ZigZag code that represents example flow:
#
# Imports the three required components and creates the component aliases
#
import <http://localhost:1714/public/services/demo_repository.rdf>
alias <http://test.org/component/push_string> as PUSH
alias <http://test.org/component/concatenate-strings> as CONCAT
alias <http://test.org/component/print-object> as PRINT
#
# Creates four instances for the flow
#
push_hello, push_world, concat, print = PUSH(), PUSH(), CONCAT(), PRINT()
#
# Sets up the properties of the instances
#
push_hello.message, push_world.message = quot;Hello quot;, quot;world!quot;
#
# Describes the data-intensive flow
#
@phres, @pwres = push_hello(), push_world()
@cres = concat( string_one: phres.string; string_two: pwres.string )
print( object: cres.concatenated_string )
#
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
21. Meandre: ZigZag Script Language
• Automatic Parallelization
– Multiple instances of a component could be run in parallel to boost
throughput.
– Specialized operator available in ZigZag Scripting to cause multiple
instances of a given component to used
• Consider a simple flow example show in the diagram
• The dataflow declaration would look like
#
# Describes the data-intensive flow
#
@pu = push()
@pt = pass( string:pu.string )
print( object:pt.string )
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
22. Meandre: ZigZag Script Language
• Automatic Parallelization
– Adding the operator [+AUTO] to middle component
# Describes the data-intensive flow
#
@pu = push()
@pt = pass( string:pu.string ) [+AUTO]
print( object:pt.string )
– [+AUTO] tells the ZigZag compiler to parallelize the “pass
component instance” by the number of cores available on
system.
– [+AUTO] may also be written [+N] where N is an numeric
value to use for example [+10].
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
23. Meandre: ZigZag Script Language
• Automatic Parallelization
– Adding the operator [+4] would result in a directed grap
# Describes the data-intensive flow # Describes the data-intensive flow
# #
@pu = push() @pu = push()
@pt = pass( string:pu.string ) [+4] @pt = pass( string:pu.string ) [+4!]
print( object:pt.string ) print( object:pt.string )
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
24. Meandre: Flows to MAU
• Flows can be executed using their RDF
descriptors
• Flows can be compiled into MAU
• MAU is:
– Self-contained representation
– Ready for execution
– Portable
– The base of flow execution in grid environments
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
25. And Behind The Scenes?
• Architecture designed to scale
• Infrastructure
– Laptop
– Server
– Cluster
• Tools
– Talk to the infrastructure
– Workbench, ZigZag
26. Meandre: The Architecture
• The design of the Meandre architecture follows
three directives:
– provide a robust and transparent scalable solution from
a laptop to large-scale clusters
– create an unified solution for batch and interactive tasks
– encourage reusing and sharing components
• To ensure such goals, the designed architecture
relies on four stacked layers and builds on top of
service-oriented architectures (SOA)
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
27. Meandre: Basic Single Server
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
28. Meandre MDX: Cloud Computing
• Servers can be
– instantiated on demand
– disposed when done or on demand
• A cluster is formed by at least one server
• The Meandre Distributed Exchange (MDX)
– Orchestrates operational integrity by managing cluster
configuration and membership using a shared database
resource.
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
29. Meandre MDX: The Picture
MDX Backbone
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
30. Meandre MDX: The Architecture
• Virtualization infrastructure
– Provide a uniform access to the underlying execution
environment. It relies on virtualization of machines and
the usage of Java for hardware abstraction.
• IO standardization
– A unified layer provides access to shared data stores,
distributed file-system, specialized metadata stores,
and access to other service-oriented architecture
gateways.
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
31. Meandre MDX: The Architecture
• Data-intensive flow infrastructure
– Provide the basic Meandre execution engine for data-
intensive flows, component repositories and discovery
mechanisms, extensible plugins and web user
interfaces (webUIs).
• Interaction layer
– Can provide self-contained applications via webUIs,
create plugins for third-party services, interact with the
embedding application that relies on the Meandre
engine, or provide services to the cloud.
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation
32. Meandre: !
Semantic-Driven Data-Intensive !
Flows in the Clouds
Xavier Llorà!
National Center for Supercomputing Applications!
University of Illinois at Urbana-Champaign!
xllora@illinois.edu
The SEASR project and its Meandre infrastructure!
are sponsored by The Andrew W. Mellon Foundation