Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

SCAP
E
Large scale preservation workflows with
Taverna

Sven Schlarb
Austrian National Library

Keeping Control: Scalable Preservation Environments for Identification and
Characterisation
Guimarães, Portugal, 07/12/2012

SCAPE
What do you mean by „Workflow“?

• Data flow rather than control flow
• (Semi-)Automated data processing pipeline
• Defined inputs and outputs
• Modular and reusable processing units
• Easy to deploy, execute, and share

SCAPE
Modularise complex preservation tasks

• Assuming that complex preservation tasks can be
separated into processing steps
• Together the steps represent the automated
processing pipeline

Quality
Migrate Characterise Ingest
Assurance

SCAPE
Experimental workflow development

• Easy to execute a workflow on standard platforms
from anywhere
• Experimental data available online or downloadable
• Reproducible experiment results
• Workflow development as a community activity

SCAPE
Taverna

• Workflow language and computational model for
creating composite data-intensive processing chains
• Developed since 2004 as a tool for life scientists and
bio-informaticians by myGrid, University of
Manchester, UK
• Available for Windows/Linux/OSX and as open
source (LGPL)

SCAPE
SCUFL/T2FLOW/SCUFL2

• Alternative to other workflow description languages,
such as the Business Process Enactment Language
(BPEL)
• SCUFL2 is Taverna's new workflow specification
language (Taverna 3), workflow bundle format, and
Java API
• SCUFL2 will replace the t2flow format (which
replaced the SCUFL format)
• Adopts Linked Data technology

SCAPE
Creating workflows using Taverna

• Users interactively build data processing pipelines
• Set of nodes represents data processing elements
• Nodes are connected by directed edges and the
workflow itself is a directed graph
• Nodes can have multiple inputs and outputs
• Workflows can contain other (embedded) workflows

SCAPE
Processors

• Web service clients (SOAP/REST)
• Local scripts (R and Beanshell languages)
• Remote shell script invocations via ssh (Tool)
• XML splitters - XSLT (interoperability!)

SCAPE
List handling: Implicit iteration over multiple
inputs
• A „single value“ input port (list depth 0) processes
values iteratively (foreach)
• A flat value list has list depth 1
• List depth > 1 for tree structures
• Multiple input ports with lists are combined as cross
product or dot product

SCAPE
Example: Tika Preservation Component

• Input:
„file“
• Processor:
Tika web service (SOAP)
• Output:
Mime-Type

SCAPE
Workflow development and execution
• Local development: Taverna Workbench

SCAPE
Workflow registry
• Web 2.0 style registry: myExperiment

SCAPE
Remote Workflow Execution
• Web client using REST API of Taverna Server

SCAPE
Hadoop

• Open source implementation of MapReduce
(Dean & Ghemawat, Google, 2004)
• Hadoop= MapReduce + HDFS
• HDFS: Distributed file system, data stored in 64MB
(default) blocks

SCAPE
Hadoop

• Job tracker (master) manages job execution on task
trackers (workers)
• Each machine is configured to dedicate processing
cores to MapReduce tasks (each core is a worker)
• Name node manages HDFS, i.e. distribution of data
blocks on data nodes

SCAPE
Hadoop job building blocks

Map/reduce
Application
(JAR)

Job configuration
Set or overwrite configuration parameters.

Map method
Create intermediate key/value pair output

Reduce method
Aggregate intermediate key/value pair output from map

SCAPE
Large scale execution environment

Cluster
Dette billede kan ikke vises i øjeblikket.

Taverna Server
File server (REST API)

Hadoop Jobtracker Apache Tomcat
Web Application

SCAPE
Example: Characterisation on a large document
collection
• Using „Tool“ service, remote ssh execution
• Orchestration of hadoop jobs (Hadoop-Streaming-
API, Hadoop Map/Reduce, and Hive)
• Available on myExperiment:
http://www.myexperiment.org/workflows/3105
• See Blogpost:
http://www.openplanetsfoundation.org/blogs/2012-
08-07-big-data-processing-chaining-hadoop-jobs-
using-taverna

SCAPE

Create text file containing JPEG2000 input file paths and read
Image metadata using Exiftool via the Hadoop Streaming API.

20

Reading image metadata SCAPE
Jp2PathCreator HadoopStreamingExiftoolRead
reading files from NAS

/NAS/Z119585409/00000001.jp2 Z119585409/00000001 2345
/NAS/Z119585409/00000002.jp2 Z119585409/00000002 2340
/NAS/Z119585409/00000003.jp2 Z119585409/00000003 2543
… …
/NAS/Z117655409/00000001.jp2 Z117655409/00000001 2300
/NAS/Z117655409/00000002.jp2 Z117655409/00000002 2300
/NAS/Z117655409/00000003.jp2 Z117655409/00000003 2345
…
find …
/NAS/Z119585987/00000001.jp2 Z119585987/00000001 2300
/NAS/Z119585987/00000002.jp2 Z119585987/00000002 2340
/NAS/Z119585987/00000003.jp2 Z119585987/00000003 2432
… …
/NAS/Z119584539/00000001.jp2 Z119584539/00000001 5205
NAS /NAS/Z119584539/00000002.jp2 Z119584539/00000002 2310
/NAS/Z119584539/00000003.jp2 Z119584539/00000003 2134
… …
/NAS/Z119599879/00000001.jp2l Z119599879/00000001 2312
/NAS/Z119589879/00000002.jp2 Z119589879/00000002
... 2300
/NAS/Z119589879/00000003.jp2 Z119589879/00000003 2300
... ...

1,4 GB 1,2 GB

60.000 books 21
24 Million pages : ~5h + ~ 38 h = ~ 43 h

SCAPE

Create text file containing HTML input file paths and create
one sequence file with the complete file content in HDFS.

22

SequenceFile creation SCAPE
HtmlPathCreator SequenceFileCreator
reading files from NAS

/NAS/Z119585409/00000707.html
/NAS/Z119585409/00000708.html
Z119585409/00000707
/NAS/Z119585409/00000709.html
…
/NAS/Z138682341/00000707.html
Z119585409/00000708
/NAS/Z138682341/00000708.html
/NAS/Z138682341/00000709.html
find …
Z119585409/00000709
/NAS/Z178791257/00000707.html
/NAS/Z178791257/00000708.html
/NAS/Z178791257/00000709.html
… Z119585409/00000710
/NAS/Z967985409/00000707.html
NAS /NAS/Z967985409/00000708.html
/NAS/Z967985409/00000709.html Z119585409/00000711
…
/NAS/Z196545409/00000707.html
/NAS/Z196545409/00000708.html Z119585409/00000712
/NAS/Z196545409/00000709.html
...

1,4 GB 997 GB (uncompressed)

60.000 books 23
24 Million pages : ~5h + ~ 24 h = ~ 29 h

SCAPE

Execute Hadoop MapReduce job using the sequence file created
before in order to calculate the average paragraph block width.

24

HTML Parsing SCAPE
HadoopAvBlockWidthMapReduce
Map Reduce
Z119585409/00000001 2100 Z119585409/00000001 2200
Z119585409/00000001 2300
Z119585409/00000001 2250
Z119585409/00000001 2400

Z119585409/00000001 Z119585409/00000002 2100 Z119585409/00000002 2200
Z119585409/00000002 2300 Z119585409/00000002 2250
Z119585409/00000002 2400

Z119585409/00000002
Z119585409/00000003 2100 Z119585409/00000003 2200
Z119585409/00000003 2300
Z119585409/00000003 2250
Z119585409/00000003 2400
Z119585409/00000003

Z119585409/00000004 2100 Z119585409/00000004 2200
Z119585409/00000004 Z119585409/00000004 2300
Z119585409/00000004 2250
Z119585409/00000004 2400

...
Z119585409/00000005 2100 Z119585409/00000005 2200
Z119585409/00000005 Z119585409/00000005 2300
Z119585409/00000005 2250
Z119585409/00000005 2400

SequenceFile Textfile

60.000 books
24 Million pages : ~6h
25

SCAPE

Create hive table and load generated data into the Hive database.

26

Analytic Queries SCAPE
HiveLoadExifData & HiveLoadHocrData
htmlwidth

Z119585409/00000001 1870
Z119585409/00000002 2100 CREATE TABLE htmlwidth
Z119585409/00000003 2015 (hid STRING, hwidth INT)
Z119585409/00000004 1350
Z119585409/00000005 1700

jp2width

Z119585409/00000001 2250
Z119585409/00000002 2150 CREATE TABLE jp2width
Z119585409/00000003 2125
Z119585409/00000004 2125 (hid STRING, jwidth INT)
Z119585409/00000005 2250

60.000 books
27

Analytic Queries SCAPE
HiveSelect
jp2width Dette billede kan ikke vises i øjeblikket.
htmlwidth

select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

60.000 books
28

SCAPE

Do a simple hive query in order to test if the database has
been created successfully.

29

SCAPE
Example: Web Archiving

30

SCAPE
Hands on – Virtual machine

• 0.20.2+923.421 Pseudo-distributed Hadoop
configuration
• Chromium Webbrowser with Hadoop Admin Links
• Taverna Workbench 2.3.0
• NetBeans IDE 7.1.2
• SampleHadoopCommand.txt (executable Hadoop
Command for DEMO1)
• Latest patches

SCAPE
Hands on – VM setup

• Unpackage scape4youTraining.tar.gz
• VirtualBox: Mashine => Add => Browse to folder =>
select VBOX file
• VM instance login:
• user: scape
• pw: scape123

SCAPE
Hands on – Demo1

• Using Hadoop for analysing ARC files
• Located at:
/example/sampleIN/ (HDFS)
• Execution via command in:
SampleHadoopCommand.txt
(on Desktop)
• Result can then be found at:
/example/sample_OUT/

SCAPE
Hands on – Demo2

• Using Taverna for analysing ARC files
• Workflow:
/home/scape/scanARC/scanARC_TIKA.t2f
low
• ADD FILE LOCATION (not add value!!)
• Input:
/home/scape/scanARC/input/ONBSample.txt
• Result:
~/scanARC/outputCSV/fullTIKAReport.c
sv
• See ~/scanARC/outputGraphics/

Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

Similar to Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012 (20)

More from SCAPE Project

More from SCAPE Project (20)

Recently uploaded

Recently uploaded (20)

Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012