Sven Schlarb of the Austrian National Library gave this introduction to large scale preservation workflows with Taverna at the first SCAPE Training event, ‘Keeping Control: Scalable Preservation Environments for Identification and Characterisation’, in Guimarães, Portugal on 6-7 December 2012.
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012
1. SCAP
E
Large scale preservation workflows with
Taverna
Sven Schlarb
Austrian National Library
Keeping Control: Scalable Preservation Environments for Identification and
Characterisation
Guimarães, Portugal, 07/12/2012
2. SCAPE
What do you mean by „Workflow“?
• Data flow rather than control flow
• (Semi-)Automated data processing pipeline
• Defined inputs and outputs
• Modular and reusable processing units
• Easy to deploy, execute, and share
3. SCAPE
Modularise complex preservation tasks
• Assuming that complex preservation tasks can be
separated into processing steps
• Together the steps represent the automated
processing pipeline
Quality
Migrate Characterise Ingest
Assurance
4. SCAPE
Experimental workflow development
• Easy to execute a workflow on standard platforms
from anywhere
• Experimental data available online or downloadable
• Reproducible experiment results
• Workflow development as a community activity
5. SCAPE
Taverna
• Workflow language and computational model for
creating composite data-intensive processing chains
• Developed since 2004 as a tool for life scientists and
bio-informaticians by myGrid, University of
Manchester, UK
• Available for Windows/Linux/OSX and as open
source (LGPL)
6. SCAPE
SCUFL/T2FLOW/SCUFL2
• Alternative to other workflow description languages,
such as the Business Process Enactment Language
(BPEL)
• SCUFL2 is Taverna's new workflow specification
language (Taverna 3), workflow bundle format, and
Java API
• SCUFL2 will replace the t2flow format (which
replaced the SCUFL format)
• Adopts Linked Data technology
7. SCAPE
Creating workflows using Taverna
• Users interactively build data processing pipelines
• Set of nodes represents data processing elements
• Nodes are connected by directed edges and the
workflow itself is a directed graph
• Nodes can have multiple inputs and outputs
• Workflows can contain other (embedded) workflows
8. SCAPE
Processors
• Web service clients (SOAP/REST)
• Local scripts (R and Beanshell languages)
• Remote shell script invocations via ssh (Tool)
• XML splitters - XSLT (interoperability!)
9. SCAPE
List handling: Implicit iteration over multiple
inputs
• A „single value“ input port (list depth 0) processes
values iteratively (foreach)
• A flat value list has list depth 1
• List depth > 1 for tree structures
• Multiple input ports with lists are combined as cross
product or dot product
10. SCAPE
Example: Tika Preservation Component
• Input:
„file“
• Processor:
Tika web service (SOAP)
• Output:
Mime-Type
11. SCAPE
Workflow development and execution
• Local development: Taverna Workbench
12. SCAPE
Workflow registry
• Web 2.0 style registry: myExperiment
13. SCAPE
Remote Workflow Execution
• Web client using REST API of Taverna Server
14. SCAPE
Hadoop
• Open source implementation of MapReduce
(Dean & Ghemawat, Google, 2004)
• Hadoop= MapReduce + HDFS
• HDFS: Distributed file system, data stored in 64MB
(default) blocks
15. SCAPE
Hadoop
• Job tracker (master) manages job execution on task
trackers (workers)
• Each machine is configured to dedicate processing
cores to MapReduce tasks (each core is a worker)
• Name node manages HDFS, i.e. distribution of data
blocks on data nodes
16. SCAPE
Hadoop job building blocks
Map/reduce
Application
(JAR)
Job configuration
Set or overwrite configuration parameters.
Map method
Create intermediate key/value pair output
Reduce method
Aggregate intermediate key/value pair output from map
18. SCAPE
Large scale execution environment
Cluster
Dette billede kan ikke vises i øjeblikket.
Taverna Server
File server (REST API)
Hadoop Jobtracker Apache Tomcat
Web Application
19. SCAPE
Example: Characterisation on a large document
collection
• Using „Tool“ service, remote ssh execution
• Orchestration of hadoop jobs (Hadoop-Streaming-
API, Hadoop Map/Reduce, and Hive)
• Available on myExperiment:
http://www.myexperiment.org/workflows/3105
• See Blogpost:
http://www.openplanetsfoundation.org/blogs/2012-
08-07-big-data-processing-chaining-hadoop-jobs-
using-taverna
20. SCAPE
Create text file containing JPEG2000 input file paths and read
Image metadata using Exiftool via the Hadoop Streaming API.
20
27. Analytic Queries SCAPE
HiveLoadExifData & HiveLoadHocrData
Dette billede kan ikke vises i øjeblikket.
htmlwidth
Z119585409/00000001 1870
Z119585409/00000002 2100 CREATE TABLE htmlwidth
Z119585409/00000003 2015 (hid STRING, hwidth INT)
Z119585409/00000004 1350
Z119585409/00000005 1700
Dette billede kan ikke vises i øjeblikket.
jp2width
Z119585409/00000001 2250
Z119585409/00000002 2150 CREATE TABLE jp2width
Z119585409/00000003 2125
Z119585409/00000004 2125 (hid STRING, jwidth INT)
Z119585409/00000005 2250
60.000 books
24 Million pages : ~6h
27
28. Analytic Queries SCAPE
HiveSelect
Dette billede kan ikke vises i øjeblikket.
jp2width Dette billede kan ikke vises i øjeblikket.
htmlwidth
select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid
Dette billede kan ikke vises i øjeblikket.
60.000 books
24 Million pages : ~6h
28
29. SCAPE
Do a simple hive query in order to test if the database has
been created successfully.
29
31. SCAPE
Hands on – Virtual machine
• 0.20.2+923.421 Pseudo-distributed Hadoop
configuration
• Chromium Webbrowser with Hadoop Admin Links
• Taverna Workbench 2.3.0
• NetBeans IDE 7.1.2
• SampleHadoopCommand.txt (executable Hadoop
Command for DEMO1)
• Latest patches
32. SCAPE
Hands on – VM setup
• Unpackage scape4youTraining.tar.gz
• VirtualBox: Mashine => Add => Browse to folder =>
select VBOX file
• VM instance login:
• user: scape
• pw: scape123
33. SCAPE
Hands on – Demo1
• Using Hadoop for analysing ARC files
• Located at:
/example/sampleIN/ (HDFS)
• Execution via command in:
SampleHadoopCommand.txt
(on Desktop)
• Result can then be found at:
/example/sample_OUT/
34. SCAPE
Hands on – Demo2
• Using Taverna for analysing ARC files
• Workflow:
/home/scape/scanARC/scanARC_TIKA.t2f
low
• ADD FILE LOCATION (not add value!!)
• Input:
/home/scape/scanARC/input/ONBSample.txt
• Result:
~/scanARC/outputCSV/fullTIKAReport.c
sv
• See ~/scanARC/outputGraphics/