SlideShare a Scribd company logo
1 of 34
Download to read offline
SCAP
                                                                     E
Large scale preservation workflows with
Taverna


Sven Schlarb
Austrian National Library

Keeping Control: Scalable Preservation Environments for Identification and
Characterisation
Guimarães, Portugal, 07/12/2012
SCAPE
        What do you mean by „Workflow“?

•   Data flow rather than control flow
•   (Semi-)Automated data processing pipeline
•   Defined inputs and outputs
•   Modular and reusable processing units
•   Easy to deploy, execute, and share
SCAPE
    Modularise complex preservation tasks

• Assuming that complex preservation tasks can be
  separated into processing steps
• Together the steps represent the automated
  processing pipeline


                                 Quality
    Migrate      Characterise                 Ingest
                                Assurance
SCAPE
     Experimental workflow development

• Easy to execute a workflow on standard platforms
  from anywhere
• Experimental data available online or downloadable
• Reproducible experiment results
• Workflow development as a community activity
SCAPE
                     Taverna

• Workflow language and computational model for
  creating composite data-intensive processing chains
• Developed since 2004 as a tool for life scientists and
  bio-informaticians by myGrid, University of
  Manchester, UK
• Available for Windows/Linux/OSX          and as open
  source (LGPL)
SCAPE
            SCUFL/T2FLOW/SCUFL2

• Alternative to other workflow description languages,
  such as the Business Process Enactment Language
  (BPEL)
• SCUFL2 is Taverna's new workflow specification
  language (Taverna 3), workflow bundle format, and
  Java API
• SCUFL2 will replace the t2flow format (which
  replaced the SCUFL format)
• Adopts Linked Data technology
SCAPE
       Creating workflows using Taverna

• Users interactively build data processing pipelines
• Set of nodes represents data processing elements
• Nodes are connected by directed edges and the
  workflow itself is a directed graph
• Nodes can have multiple inputs and outputs
• Workflows can contain other (embedded) workflows
SCAPE
                     Processors

•   Web service clients (SOAP/REST)
•   Local scripts (R and Beanshell languages)
•   Remote shell script invocations via ssh (Tool)
•   XML splitters - XSLT (interoperability!)
SCAPE
    List handling: Implicit iteration over multiple
                           inputs
•   A „single value“ input port (list depth 0) processes
    values iteratively (foreach)
•   A flat value list has list depth 1
•   List depth > 1 for tree structures
•   Multiple input ports with lists are combined as cross
    product or dot product
SCAPE
    Example: Tika Preservation Component

• Input:
  „file“
• Processor:
  Tika web service (SOAP)
• Output:
  Mime-Type
SCAPE
    Workflow development and execution
• Local development: Taverna Workbench
SCAPE
               Workflow registry
• Web 2.0 style registry: myExperiment
SCAPE
          Remote Workflow Execution
• Web client using REST API of Taverna Server
SCAPE
                    Hadoop

• Open source implementation of MapReduce
  (Dean & Ghemawat, Google, 2004)
• Hadoop= MapReduce + HDFS
• HDFS: Distributed file system, data stored in 64MB
  (default) blocks
SCAPE
                    Hadoop

• Job tracker (master) manages job execution on task
  trackers (workers)
• Each machine is configured to dedicate processing
  cores to MapReduce tasks (each core is a worker)
• Name node manages HDFS, i.e. distribution of data
  blocks on data nodes
SCAPE
              Hadoop job building blocks

Map/reduce
Application
  (JAR)


                                 Job configuration
                        Set or overwrite configuration parameters.




                                    Map method
                        Create intermediate key/value pair output




                                  Reduce method
                  Aggregate intermediate key/value pair output from map
SCAPE
Cluster
SCAPE
        Large scale execution environment


                                                            Cluster
               Dette billede kan ikke vises i øjeblikket.




                                                                       Taverna Server
File server                                                                (REST API)




                                                                      Hadoop Jobtracker   Apache Tomcat
                                                                                           Web Application
SCAPE
 Example: Characterisation on a large document
                    collection
• Using „Tool“ service, remote ssh execution
• Orchestration of hadoop jobs (Hadoop-Streaming-
  API, Hadoop Map/Reduce, and Hive)
• Available on myExperiment:
  http://www.myexperiment.org/workflows/3105
• See Blogpost:
  http://www.openplanetsfoundation.org/blogs/2012-
  08-07-big-data-processing-chaining-hadoop-jobs-
  using-taverna
SCAPE




Create text file containing JPEG2000 input file paths and read
Image metadata using Exiftool via the Hadoop Streaming API.

                                                                     20
Reading image metadata                                                          SCAPE
             Jp2PathCreator                                 HadoopStreamingExiftoolRead
                         reading files from NAS




                   /NAS/Z119585409/00000001.jp2                             Z119585409/00000001   2345
                   /NAS/Z119585409/00000002.jp2                             Z119585409/00000002   2340
                   /NAS/Z119585409/00000003.jp2                             Z119585409/00000003   2543
                   …                                                        …
                   /NAS/Z117655409/00000001.jp2                             Z117655409/00000001   2300
                   /NAS/Z117655409/00000002.jp2                             Z117655409/00000002   2300
                   /NAS/Z117655409/00000003.jp2                             Z117655409/00000003   2345
                                                                            …
            find   …
                   /NAS/Z119585987/00000001.jp2                             Z119585987/00000001   2300
                   /NAS/Z119585987/00000002.jp2                             Z119585987/00000002   2340
                   /NAS/Z119585987/00000003.jp2                             Z119585987/00000003   2432
                   …                                                        …
                   /NAS/Z119584539/00000001.jp2                             Z119584539/00000001   5205
   NAS             /NAS/Z119584539/00000002.jp2                             Z119584539/00000002   2310
                   /NAS/Z119584539/00000003.jp2                             Z119584539/00000003   2134
                   …                                                        …
                   /NAS/Z119599879/00000001.jp2l                            Z119599879/00000001   2312
                   /NAS/Z119589879/00000002.jp2                             Z119589879/00000002
                                                                               ...                2300
                   /NAS/Z119589879/00000003.jp2                             Z119589879/00000003   2300
                   ...                                                      ...


                                1,4 GB                                    1,2 GB

   60.000 books                                                                                          21
  24 Million pages :      ~5h            +         ~ 38 h    =   ~ 43 h
SCAPE




Create text file containing HTML input file paths and create
one sequence file with the complete file content in HDFS.

                                                                   22
SequenceFile creation                                                         SCAPE
            HtmlPathCreator                                    SequenceFileCreator
                        reading files from NAS




                  /NAS/Z119585409/00000707.html
                  /NAS/Z119585409/00000708.html
                                                                          Z119585409/00000707
                  /NAS/Z119585409/00000709.html
                  …
                  /NAS/Z138682341/00000707.html
                                                                          Z119585409/00000708
                  /NAS/Z138682341/00000708.html
                  /NAS/Z138682341/00000709.html
           find   …
                                                                          Z119585409/00000709
                  /NAS/Z178791257/00000707.html
                  /NAS/Z178791257/00000708.html
                  /NAS/Z178791257/00000709.html
                  …                                                       Z119585409/00000710
                  /NAS/Z967985409/00000707.html
   NAS            /NAS/Z967985409/00000708.html
                  /NAS/Z967985409/00000709.html                           Z119585409/00000711
                  …
                  /NAS/Z196545409/00000707.html
                  /NAS/Z196545409/00000708.html                           Z119585409/00000712
                  /NAS/Z196545409/00000709.html
                  ...


                               1,4 GB                                   997 GB (uncompressed)

   60.000 books                                                                                 23
  24 Million pages :     ~5h            +         ~ 24 h   =   ~ 29 h
SCAPE




Execute Hadoop MapReduce job using the sequence file created
before in order to calculate the average paragraph block width.

                                                                      24
HTML Parsing                                                                 SCAPE
                      HadoopAvBlockWidthMapReduce
                                             Map            Reduce
                                    Z119585409/00000001 2100 Z119585409/00000001 2200
                                    Z119585409/00000001 2300
                                                                         Z119585409/00000001 2250
                                    Z119585409/00000001 2400


Z119585409/00000001                 Z119585409/00000002 2100 Z119585409/00000002 2200
                                    Z119585409/00000002 2300             Z119585409/00000002 2250
                                    Z119585409/00000002 2400

Z119585409/00000002
                                    Z119585409/00000003 2100 Z119585409/00000003 2200
                                    Z119585409/00000003 2300
                                                                         Z119585409/00000003 2250
                                    Z119585409/00000003 2400
Z119585409/00000003


                                    Z119585409/00000004 2100 Z119585409/00000004 2200
Z119585409/00000004                 Z119585409/00000004 2300
                                                                        Z119585409/00000004 2250
                                    Z119585409/00000004 2400


                                      ...
                                    Z119585409/00000005 2100 Z119585409/00000005 2200
Z119585409/00000005                 Z119585409/00000005 2300
                                                                        Z119585409/00000005 2250
                                    Z119585409/00000005 2400


    SequenceFile                                                                   Textfile

 60.000 books
24 Million pages :            ~6h
                                                                                                    25
SCAPE




Create hive table and load generated data into the Hive database.


                                                                        26
Analytic Queries                                                                                                       SCAPE
                             HiveLoadExifData & HiveLoadHocrData
                                                                   Dette billede kan ikke vises i øjeblikket.
                                                                                                                     htmlwidth


  Z119585409/00000001 1870
  Z119585409/00000002 2100            CREATE TABLE htmlwidth
  Z119585409/00000003 2015            (hid STRING, hwidth INT)
  Z119585409/00000004 1350
  Z119585409/00000005 1700




                                                                        Dette billede kan ikke vises i øjeblikket.
                                                                                                                     jp2width


  Z119585409/00000001 2250
  Z119585409/00000002 2150            CREATE TABLE jp2width
  Z119585409/00000003 2125
  Z119585409/00000004 2125            (hid STRING, jwidth INT)
  Z119585409/00000005 2250




 60.000 books
24 Million pages :                          ~6h
                                                                                                                                 27
Analytic Queries                                                                                                                                               SCAPE
                                                                                                   HiveSelect
    Dette billede kan ikke vises i øjeblikket.
                                                 jp2width                                                       Dette billede kan ikke vises i øjeblikket.
                                                                                                                                                             htmlwidth




 select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid
                                                     Dette billede kan ikke vises i øjeblikket.




 60.000 books
24 Million pages :                                                                                ~6h
                                                                                                                                                                         28
SCAPE




Do a simple hive query in order to test if the database has
been created successfully.

                                                                  29
SCAPE
Example: Web Archiving




                             30
SCAPE
         Hands on – Virtual machine

• 0.20.2+923.421 Pseudo-distributed Hadoop
  configuration
• Chromium Webbrowser with Hadoop Admin Links
• Taverna Workbench 2.3.0
• NetBeans IDE 7.1.2
• SampleHadoopCommand.txt (executable Hadoop
  Command for DEMO1)
• Latest patches
SCAPE
             Hands on – VM setup

• Unpackage scape4youTraining.tar.gz
• VirtualBox: Mashine => Add => Browse to folder =>
  select VBOX file
• VM instance login:
  • user: scape
  • pw: scape123
SCAPE
               Hands on – Demo1

• Using Hadoop for analysing ARC files
• Located at:
   /example/sampleIN/ (HDFS)
• Execution via command in:
   SampleHadoopCommand.txt
  (on Desktop)
• Result can then be found at:
   /example/sample_OUT/
SCAPE
           Hands on – Demo2

• Using Taverna for analysing ARC files
• Workflow:
  /home/scape/scanARC/scanARC_TIKA.t2f
  low
  • ADD FILE LOCATION (not add value!!)
  • Input:
    /home/scape/scanARC/input/ONBSample.txt
• Result:
  ~/scanARC/outputCSV/fullTIKAReport.c
  sv
• See ~/scanARC/outputGraphics/

More Related Content

What's hot

Oracle 11G SCAN: Concepts and Implementation Experience Sharing
Oracle 11G SCAN: Concepts and Implementation Experience SharingOracle 11G SCAN: Concepts and Implementation Experience Sharing
Oracle 11G SCAN: Concepts and Implementation Experience SharingYury Velikanov
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)
 
Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Tsuyoshi OZAWA
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopAllen Wittenauer
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 
All Oracle DBAs have to know about Unix Memory Monitoring
All Oracle DBAs have to know about Unix Memory MonitoringAll Oracle DBAs have to know about Unix Memory Monitoring
All Oracle DBAs have to know about Unix Memory MonitoringYury Velikanov
 
Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2ovarene
 
Advanced Oracle Troubleshooting
Advanced Oracle TroubleshootingAdvanced Oracle Troubleshooting
Advanced Oracle TroubleshootingHector Martinez
 
Preppingthekitchen 1.0.3
Preppingthekitchen 1.0.3Preppingthekitchen 1.0.3
Preppingthekitchen 1.0.3Sean OMeara
 

What's hot (16)

Oracle 11G SCAN: Concepts and Implementation Experience Sharing
Oracle 11G SCAN: Concepts and Implementation Experience SharingOracle 11G SCAN: Concepts and Implementation Experience Sharing
Oracle 11G SCAN: Concepts and Implementation Experience Sharing
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Dockerize All The Things
Dockerize All The ThingsDockerize All The Things
Dockerize All The Things
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
 
Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
All Oracle DBAs have to know about Unix Memory Monitoring
All Oracle DBAs have to know about Unix Memory MonitoringAll Oracle DBAs have to know about Unix Memory Monitoring
All Oracle DBAs have to know about Unix Memory Monitoring
 
Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2
 
Advanced Oracle Troubleshooting
Advanced Oracle TroubleshootingAdvanced Oracle Troubleshooting
Advanced Oracle Troubleshooting
 
Preppingthekitchen 1.0.3
Preppingthekitchen 1.0.3Preppingthekitchen 1.0.3
Preppingthekitchen 1.0.3
 

Similar to Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

Collaborative Workflow Development and Experimentation in the Digital Humanities
Collaborative Workflow Development and Experimentation in the Digital HumanitiesCollaborative Workflow Development and Experimentation in the Digital Humanities
Collaborative Workflow Development and Experimentation in the Digital Humanitiescneudecker
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramViswanath Gangavaram
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 SummitMatt Ray
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleBig Data Joe™ Rossi
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
Australian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackAustralian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackMatt Ray
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
 
DataConf.TW2018: Develop Kafka Streams Application on Your Laptop
DataConf.TW2018: Develop Kafka Streams Application on Your LaptopDataConf.TW2018: Develop Kafka Streams Application on Your Laptop
DataConf.TW2018: Develop Kafka Streams Application on Your LaptopYu-Jhe Li
 

Similar to Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012 (20)

Collaborative Workflow Development and Experimentation in the Digital Humanities
Collaborative Workflow Development and Experimentation in the Digital HumanitiesCollaborative Workflow Development and Experimentation in the Digital Humanities
Collaborative Workflow Development and Experimentation in the Digital Humanities
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 Summit
 
Chef for OpenStack- Fall 2012.pdf
Chef for OpenStack- Fall 2012.pdfChef for OpenStack- Fall 2012.pdf
Chef for OpenStack- Fall 2012.pdf
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
Hadoop at Nokia
Hadoop at NokiaHadoop at Nokia
Hadoop at Nokia
 
Australian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackAustralian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStack
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
DataConf.TW2018: Develop Kafka Streams Application on Your Laptop
DataConf.TW2018: Develop Kafka Streams Application on Your LaptopDataConf.TW2018: Develop Kafka Streams Application on Your Laptop
DataConf.TW2018: Develop Kafka Streams Application on Your Laptop
 

More from SCAPE Project

SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Project
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...SCAPE Project
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...SCAPE Project
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsSCAPE Project
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulationSCAPE Project
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsSCAPE Project
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE Project
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalitySCAPE Project
 

More from SCAPE Project (20)

C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Large scale preservation workflows with Taverna – SCAPE Training event, Guimarães 2012

  • 1. SCAP E Large scale preservation workflows with Taverna Sven Schlarb Austrian National Library Keeping Control: Scalable Preservation Environments for Identification and Characterisation Guimarães, Portugal, 07/12/2012
  • 2. SCAPE What do you mean by „Workflow“? • Data flow rather than control flow • (Semi-)Automated data processing pipeline • Defined inputs and outputs • Modular and reusable processing units • Easy to deploy, execute, and share
  • 3. SCAPE Modularise complex preservation tasks • Assuming that complex preservation tasks can be separated into processing steps • Together the steps represent the automated processing pipeline Quality Migrate Characterise Ingest Assurance
  • 4. SCAPE Experimental workflow development • Easy to execute a workflow on standard platforms from anywhere • Experimental data available online or downloadable • Reproducible experiment results • Workflow development as a community activity
  • 5. SCAPE Taverna • Workflow language and computational model for creating composite data-intensive processing chains • Developed since 2004 as a tool for life scientists and bio-informaticians by myGrid, University of Manchester, UK • Available for Windows/Linux/OSX and as open source (LGPL)
  • 6. SCAPE SCUFL/T2FLOW/SCUFL2 • Alternative to other workflow description languages, such as the Business Process Enactment Language (BPEL) • SCUFL2 is Taverna's new workflow specification language (Taverna 3), workflow bundle format, and Java API • SCUFL2 will replace the t2flow format (which replaced the SCUFL format) • Adopts Linked Data technology
  • 7. SCAPE Creating workflows using Taverna • Users interactively build data processing pipelines • Set of nodes represents data processing elements • Nodes are connected by directed edges and the workflow itself is a directed graph • Nodes can have multiple inputs and outputs • Workflows can contain other (embedded) workflows
  • 8. SCAPE Processors • Web service clients (SOAP/REST) • Local scripts (R and Beanshell languages) • Remote shell script invocations via ssh (Tool) • XML splitters - XSLT (interoperability!)
  • 9. SCAPE List handling: Implicit iteration over multiple inputs • A „single value“ input port (list depth 0) processes values iteratively (foreach) • A flat value list has list depth 1 • List depth > 1 for tree structures • Multiple input ports with lists are combined as cross product or dot product
  • 10. SCAPE Example: Tika Preservation Component • Input: „file“ • Processor: Tika web service (SOAP) • Output: Mime-Type
  • 11. SCAPE Workflow development and execution • Local development: Taverna Workbench
  • 12. SCAPE Workflow registry • Web 2.0 style registry: myExperiment
  • 13. SCAPE Remote Workflow Execution • Web client using REST API of Taverna Server
  • 14. SCAPE Hadoop • Open source implementation of MapReduce (Dean & Ghemawat, Google, 2004) • Hadoop= MapReduce + HDFS • HDFS: Distributed file system, data stored in 64MB (default) blocks
  • 15. SCAPE Hadoop • Job tracker (master) manages job execution on task trackers (workers) • Each machine is configured to dedicate processing cores to MapReduce tasks (each core is a worker) • Name node manages HDFS, i.e. distribution of data blocks on data nodes
  • 16. SCAPE Hadoop job building blocks Map/reduce Application (JAR) Job configuration Set or overwrite configuration parameters. Map method Create intermediate key/value pair output Reduce method Aggregate intermediate key/value pair output from map
  • 18. SCAPE Large scale execution environment Cluster Dette billede kan ikke vises i øjeblikket. Taverna Server File server (REST API) Hadoop Jobtracker Apache Tomcat Web Application
  • 19. SCAPE Example: Characterisation on a large document collection • Using „Tool“ service, remote ssh execution • Orchestration of hadoop jobs (Hadoop-Streaming- API, Hadoop Map/Reduce, and Hive) • Available on myExperiment: http://www.myexperiment.org/workflows/3105 • See Blogpost: http://www.openplanetsfoundation.org/blogs/2012- 08-07-big-data-processing-chaining-hadoop-jobs- using-taverna
  • 20. SCAPE Create text file containing JPEG2000 input file paths and read Image metadata using Exiftool via the Hadoop Streaming API. 20
  • 21. Reading image metadata SCAPE Jp2PathCreator HadoopStreamingExiftoolRead reading files from NAS /NAS/Z119585409/00000001.jp2 Z119585409/00000001 2345 /NAS/Z119585409/00000002.jp2 Z119585409/00000002 2340 /NAS/Z119585409/00000003.jp2 Z119585409/00000003 2543 … … /NAS/Z117655409/00000001.jp2 Z117655409/00000001 2300 /NAS/Z117655409/00000002.jp2 Z117655409/00000002 2300 /NAS/Z117655409/00000003.jp2 Z117655409/00000003 2345 … find … /NAS/Z119585987/00000001.jp2 Z119585987/00000001 2300 /NAS/Z119585987/00000002.jp2 Z119585987/00000002 2340 /NAS/Z119585987/00000003.jp2 Z119585987/00000003 2432 … … /NAS/Z119584539/00000001.jp2 Z119584539/00000001 5205 NAS /NAS/Z119584539/00000002.jp2 Z119584539/00000002 2310 /NAS/Z119584539/00000003.jp2 Z119584539/00000003 2134 … … /NAS/Z119599879/00000001.jp2l Z119599879/00000001 2312 /NAS/Z119589879/00000002.jp2 Z119589879/00000002 ... 2300 /NAS/Z119589879/00000003.jp2 Z119589879/00000003 2300 ... ... 1,4 GB 1,2 GB 60.000 books 21 24 Million pages : ~5h + ~ 38 h = ~ 43 h
  • 22. SCAPE Create text file containing HTML input file paths and create one sequence file with the complete file content in HDFS. 22
  • 23. SequenceFile creation SCAPE HtmlPathCreator SequenceFileCreator reading files from NAS /NAS/Z119585409/00000707.html /NAS/Z119585409/00000708.html Z119585409/00000707 /NAS/Z119585409/00000709.html … /NAS/Z138682341/00000707.html Z119585409/00000708 /NAS/Z138682341/00000708.html /NAS/Z138682341/00000709.html find … Z119585409/00000709 /NAS/Z178791257/00000707.html /NAS/Z178791257/00000708.html /NAS/Z178791257/00000709.html … Z119585409/00000710 /NAS/Z967985409/00000707.html NAS /NAS/Z967985409/00000708.html /NAS/Z967985409/00000709.html Z119585409/00000711 … /NAS/Z196545409/00000707.html /NAS/Z196545409/00000708.html Z119585409/00000712 /NAS/Z196545409/00000709.html ... 1,4 GB 997 GB (uncompressed) 60.000 books 23 24 Million pages : ~5h + ~ 24 h = ~ 29 h
  • 24. SCAPE Execute Hadoop MapReduce job using the sequence file created before in order to calculate the average paragraph block width. 24
  • 25. HTML Parsing SCAPE HadoopAvBlockWidthMapReduce Map Reduce Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2250 Z119585409/00000001 2400 Z119585409/00000001 Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2250 Z119585409/00000002 2400 Z119585409/00000002 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2250 Z119585409/00000003 2400 Z119585409/00000003 Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 Z119585409/00000004 2300 Z119585409/00000004 2250 Z119585409/00000004 2400 ... Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 Z119585409/00000005 2300 Z119585409/00000005 2250 Z119585409/00000005 2400 SequenceFile Textfile 60.000 books 24 Million pages : ~6h 25
  • 26. SCAPE Create hive table and load generated data into the Hive database. 26
  • 27. Analytic Queries SCAPE HiveLoadExifData & HiveLoadHocrData Dette billede kan ikke vises i øjeblikket. htmlwidth Z119585409/00000001 1870 Z119585409/00000002 2100 CREATE TABLE htmlwidth Z119585409/00000003 2015 (hid STRING, hwidth INT) Z119585409/00000004 1350 Z119585409/00000005 1700 Dette billede kan ikke vises i øjeblikket. jp2width Z119585409/00000001 2250 Z119585409/00000002 2150 CREATE TABLE jp2width Z119585409/00000003 2125 Z119585409/00000004 2125 (hid STRING, jwidth INT) Z119585409/00000005 2250 60.000 books 24 Million pages : ~6h 27
  • 28. Analytic Queries SCAPE HiveSelect Dette billede kan ikke vises i øjeblikket. jp2width Dette billede kan ikke vises i øjeblikket. htmlwidth select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid Dette billede kan ikke vises i øjeblikket. 60.000 books 24 Million pages : ~6h 28
  • 29. SCAPE Do a simple hive query in order to test if the database has been created successfully. 29
  • 31. SCAPE Hands on – Virtual machine • 0.20.2+923.421 Pseudo-distributed Hadoop configuration • Chromium Webbrowser with Hadoop Admin Links • Taverna Workbench 2.3.0 • NetBeans IDE 7.1.2 • SampleHadoopCommand.txt (executable Hadoop Command for DEMO1) • Latest patches
  • 32. SCAPE Hands on – VM setup • Unpackage scape4youTraining.tar.gz • VirtualBox: Mashine => Add => Browse to folder => select VBOX file • VM instance login: • user: scape • pw: scape123
  • 33. SCAPE Hands on – Demo1 • Using Hadoop for analysing ARC files • Located at: /example/sampleIN/ (HDFS) • Execution via command in: SampleHadoopCommand.txt (on Desktop) • Result can then be found at: /example/sample_OUT/
  • 34. SCAPE Hands on – Demo2 • Using Taverna for analysing ARC files • Workflow: /home/scape/scanARC/scanARC_TIKA.t2f low • ADD FILE LOCATION (not add value!!) • Input: /home/scape/scanARC/input/ONBSample.txt • Result: ~/scanARC/outputCSV/fullTIKAReport.c sv • See ~/scanARC/outputGraphics/