SlideShare uma empresa Scribd logo
1 de 39
Kurator: Towards Data Curation
Workflows for Mere Mortals
An extensible, open-source workflow platform
for users & makers of data curation tools
B. Ludäscher J. Hanken D. Lowery J.A. Macklin
T. McPhillips P.J. Morris R.A. Morris T. Song
Problem: Data & Metadata Quality
• Collections & occurrence data can
be all over the map
• Examples:
– Lat/Long transposition, other geo-
ref issues (projections, …)
– Scientific Names (spelling errors,
other)
– Data entry/creation, “fuzzy” data,
naming issues, bit rot, data
conversions and transformations,
schema mappings, …
• Related:
– Filtered-Push
SPNHC'15 Kurator/P 2
What problems are we trying to solve?
• Detect and flag data quality issues
• Repair if possible
– … ask human curators as needed
• Keep track of provenance
– (semi-)automatic repairs
– human curators’ edits
• Employ workflow (semi-)automation
– Scientific workflow systems:
• Kepler/COMAD, Restflow, Galaxy, Biovel/Taverna, Argo, VisTrails, …
– Related technologies
• Akka parallel execution platform
• Script-based automation (e.g. Python, R), digital notebooks (iPython)
3SPNHC'15 Kurator/P
Customers of Curation Workflows
• Collection Managers
– … who are managing the collections databases
– Can run curation workflows periodically
• … in the presence of new data and/or new curation services
• (Biodiversity) Researchers
– To perform an analysis in the presence of (partially)
dirty data, researchers need to
• Clean or fix dirty data
• Throw out unfixable data
– Reporting back to the collection managers (cf. FPush)
4SPNHC'15 Kurator/P
How we do it (Part 1 …): Kepler curation workflows
• Why workflows? ASAP!
SPNHC'15 Kurator/P 5
Dou, Lei., G. Cao, P.J. Morris, R.A. Morris, B. Ludäscher, J.A. Macklin, J. Hanken. 2012. Kurator: A Kepler Package for
Data Curation Workflows, Procedia Computer Science, 9:1614-1619, doi:10.1016/j.procs.2012.04.177
Scientific Workflows: ASAP!
• Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles)
– wfs should make use of parallel compute resources
– wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles)
– wfs should be easy to (re-)use, evolve, share
• Provenance
– wfs should capture processing history, data lineage
 traceable data- and wf-evolution
 Reproducible Science Trident
Workbench
VisTrails
SPNHC'15 Kurator/P 6
Scientific workflows: a(nother) silver bullet?
Beware of the Turing tar-pit in which everything is
possible but nothing of interest is easy.
—Alan Perlis
SPNHC'15 Kurator/P 7
I beg your pardon, I never promised
..
“Thanks to our Graphical UI your scientific workflows will be much
easier to develop, understand and maintain!”
Hmm… this was supposed to be easier than programming!
SPNHC'15 Kurator/P 8
Scientific Workflows …
Cabellos et al. Computer Physics Communications 182, 2011
SPNHC'15 Kurator/P 9
… are a wonderful thing …
SPNHC'15 Kurator/P 10
Norbert Podhorszki
(then: UC Davis)
… after simplifying a bit (here: Kepler/COMAD)
SPNHC'15 Kurator/P 11
Sven Köhler
(then: UC Davis)
So many systems (models of computation ~ languages, … )
… so little time …
SPNHC'15 Kurator/P 12
Sven Köhler
(then: UC Davis)
SPNHC'15 Kurator/P 13
Workflow Systems: Learning to program, all over again …
Scientific Workflow: it’s called R&D for a reason …
14
Workflow Modeling & Design
(prospective provenance)
Runtime Provenance
(traces,
retrospective
provenance)
Fault-tolerance
crash recovery
Scalability
parallel processing
SPNHC'15 Kurator/P
Meanwhile, on a nearby planet …
Highly dynamic visualization
(so dynamic, it’s hard to capture)
SPNHC'15 Kurator/P 15
It’s time to Shift Control …
• … back from being consumers of tools
“Just click here!”
• ... to tool makers!
• Kurator/P:
– Yes, develop for end users …
– … but don’t forget the tool makers!
• Can we do this together?
SPNHC'15 Kurator/P 16
How we do it (Part 2 of … )
• Kurator
– Apply workflow technologies and workflow thinking
– … in a technology agnostic way (if possible)
 Build a library of curation services such that curation
workflows can be run from various platforms
– Scientific workflow systems
• e.g. Restflow, Kepler, Taverna, Galaxy
– Other platforms
• e.g. Akka, Python-based, …
• … leveraging existing technologies
17SPNHC'15 Kurator/P
How we do it (Part 3 of … )
• YesWorkflow
– Grass-roots effort (goes well with stone-soup…)
– Scripts + User annotations
 Can give us much of ASAP!
• Key Ideas:
– Meet the tool makers and researchers (R, Python, …)
– Make them workflow/dataflow thinkers …
– … but giving them workflow benefits (ASAP!)
– … via simple annotations!
18SPNHC'15 Kurator/P
SKOPE: Synthesized Knowledge Of Past Environments
SPNHC'15 Kurator/P 19
Bocinsky, Kohler et al. study rain-fed maize of Anasazi
– Four Corners; AD 600–1500. Climate change influenced Mesa Verde
Migrations; late 13th century AD. Uses network of tree-ring chronologies to
reconstruct a spatio-temporal climate field at a fairly high resolution (~800 m)
from AD 1–2000. Algorithm estimates joint information in tree-rings and a
climate signal to identify “best” tree-ring chronologies for climate
reconstructing.
K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed
maize agricultural niche in the US Southwest. Nature
Communications. doi:10.1038/ncomms6618
… implemented as an R Script
User Comments: YW Annotations
SPNHC'15 Kurator/P 20
@begin GO_Analysis
@in hgCutoff
@in …
@out BP_Summl_file
@out …
@end GO_Analysis
...
main
fetch_mask
input_mask_file
load_data
input_data_file standardize_with_mask
land_water_mask
NEE_data simple_diagnose
standardized_NEE_data result_NEE_pdf
Get 3 views for the price of 1!
SPNHC'15 Kurator/P 21
result_NEE_pdf
input_mask_file land_water_mask
fetch_mask
input_data_file NEE_data
load_data
standardized_NEE_data
standardize_with_mask
standardize_with_mask
simple_diagnose
fetch_mask land_water_mask
load_data NEE_data
standardize_with_mask standardized_NEE_data simple_diagnose result_NEE_pdf
input_mask_file
input_data_file
Process view
Data view
Combined view
Paleoclimate Reconstruction (EnviRecon.org)
SPNHC'15 Kurator/P 22
• … explained using YesWorkflow!
Kyle B., (computational) archaeologist:
"It took me about 20 minutes to comment. Less
than an hour to learn and YW-annotate, all-told."
The Road ahead …
• YesWorkflow:
– … finishing support for retrospective provenance
without using a runtime provenance recorder!
– Key insight: scientists already leave provenance “bread
crumbs” behind! (it’s not an accident!)
– Exploit that via annotations: URI-templates
• Kurator[/P]:
– How far can we go towards ASAP via YW?
SPNHC'15 Kurator/P 23
YesWorkflow.org
SPNHC'15 Kurator/P 24
YW-RECON: Prospective & Retrospective
Provenance … (almost) for free!
• YW annotations in
the script (R,
Python, Matlab)
are used to
recreate the
workflow view
from the script …
SPNHC'15 Kurator/P 25
YW-RECON: Prospective & Retrospective
Provenance … (almost) for free!
SPNHC'15 Kurator/P 26
• URI-templates link conceptual entities
to runtime provenance “left behind” by
the script author …
• … facilitating provenance reconstruction
Summary: Data Curation with
Scientific Workflow Systems
Scientific Workflows
• [+] Automation
• [+] Scalability
• [+] Abstraction
• [+] Provenance
• …
• [+/0] Easy to use
– [0] learning a new paradigm
• [-] Teaching resources: learning a new language!
• [-] Special expertise needed for deep changes
e.g. new Java actors, shims, …
SPNHC'15 Kurator/P 27
Kurator/P: Scripts + YesWorkflow ++
Scripts: [+] Automation, [0] Scalability, [-] Abstraction, [0/-] Provenance
Now: Scripts + YesWorkflow Annotations
• [+] Abstraction
– explain your methods to mere mortals
=> encourage (re-)use
• [+] Provenance:
– YesWorkflow (prospective and retrospective provenance)
• [+] Language independent (R, Matlab, Python, …)
• [+] Empower tool makers (script programmers): give them …
– … some immediate benefits (workflow views, retrospective provenance)
– … some long term benefits: think about your methods differently
=> dataflow programming => [+] Scalability
SPNHC'15 Kurator/P 28
Acknowledgments
• NSF-DBI #1356751
– Collaborative Research: ABI Development:
Kurator: A Provenance-enabled Workflow Platform
and Toolkit to Curate Biodiversity Data
SPNHC'15 Kurator/P 29
Additional Material
Date Validation
• Check:
– Collector’s life span
– .. vs. Date-Collected
• Possible outcomes:
– Valid
– Corrected
– Unable to validate
• Internal inconsistency
– Contradicting dates
• External inconsistency
– Lack of date data
31SPNHC'15 Kurator/P
… Logic Behind Each Step (cont’d)
• Scientific Name Validation
– Customer-dependent:
• Collection Managers:
– Nomenclature
• Researchers:
– Taxonomy (current names)
– Several Remote services
• IPNI, GNI, …
• …. <your logic here> …
32SPNHC'15 Kurator/P
Simplified Example Workflow
• Related Research (Tianhong Song, UC Davis)
– Analyze linear workflow “story”
– Use patterns to discover wf design issues
(e.g. use before update); then fix them
– Parallelize when possible
SPNHC'15 Kurator/P 33
• Allow easy
assembly of such
workflows
• For tool makers
• … and tool users
• … scalability …
Example Output …
SPNHC'15 Kurator/P 34
… close up …
SPNHC'15 Kurator/P 35
FilteredPush Curation Provenance
(Spreadsheet View)
SPNHC'15 Kurator/P 36
Agile Kurator Development
37SPNHC'15 Kurator/P
Related Research (Tianhong Song, UC Davis)
• Analyze linear workflow
“story”
• Use patterns to discover wf
design issues (e.g. use before
update); then fix them
• Parallelize when possible
38SPNHC'15 Kurator/P
Contact me!
• If you’re interested in a project, research theme
(or similar ones): Send me email!
– Email: ludaesch@illinois.edu
SPNHC'15 Kurator/P 39

Mais conteúdo relacionado

Semelhante a Kurator: Towards Data Curation Workflows for Mere Mortals

Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science researchAnubhav Jain
 
Scilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersScilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersNaren P.R.
 
GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)Bertram Ludäscher
 
Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13Naren P.R.
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
 
Personal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingPersonal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingChawanat Nakasan
 
HPC Resource Accounting
HPC Resource AccountingHPC Resource Accounting
HPC Resource AccountingKen Schumacher
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_sparkGeetanjali G
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsIlkay Altintas, Ph.D.
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Anubhav Jain
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 

Semelhante a Kurator: Towards Data Curation Workflows for Mere Mortals (20)

Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
week15a.pdf
week15a.pdfweek15a.pdf
week15a.pdf
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
Scilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersScilab: Computing Tool For Engineers
Scilab: Computing Tool For Engineers
 
GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)
 
Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Personal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingPersonal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research Meeting
 
Dancing with Stream Processing
Dancing with Stream ProcessingDancing with Stream Processing
Dancing with Stream Processing
 
HPC Resource Accounting
HPC Resource AccountingHPC Resource Accounting
HPC Resource Accounting
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_spark
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable Workflows
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 

Mais de Bertram Ludäscher

Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Bertram Ludäscher
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database RulesBertram Ludäscher
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database RulesBertram Ludäscher
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsBertram Ludäscher
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueBertram Ludäscher
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesBertram Ludäscher
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseBertram Ludäscher
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...Bertram Ludäscher
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...Bertram Ludäscher
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...Bertram Ludäscher
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsBertram Ludäscher
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsBertram Ludäscher
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachBertram Ludäscher
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchBertram Ludäscher
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatBertram Ludäscher
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceBertram Ludäscher
 
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionWild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionBertram Ludäscher
 

Mais de Bertram Ludäscher (20)

Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A Dialogue
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflows
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of Research
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's Seat
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable Provenance
 
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionWild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
 

Último

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Último (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Kurator: Towards Data Curation Workflows for Mere Mortals

  • 1. Kurator: Towards Data Curation Workflows for Mere Mortals An extensible, open-source workflow platform for users & makers of data curation tools B. Ludäscher J. Hanken D. Lowery J.A. Macklin T. McPhillips P.J. Morris R.A. Morris T. Song
  • 2. Problem: Data & Metadata Quality • Collections & occurrence data can be all over the map • Examples: – Lat/Long transposition, other geo- ref issues (projections, …) – Scientific Names (spelling errors, other) – Data entry/creation, “fuzzy” data, naming issues, bit rot, data conversions and transformations, schema mappings, … • Related: – Filtered-Push SPNHC'15 Kurator/P 2
  • 3. What problems are we trying to solve? • Detect and flag data quality issues • Repair if possible – … ask human curators as needed • Keep track of provenance – (semi-)automatic repairs – human curators’ edits • Employ workflow (semi-)automation – Scientific workflow systems: • Kepler/COMAD, Restflow, Galaxy, Biovel/Taverna, Argo, VisTrails, … – Related technologies • Akka parallel execution platform • Script-based automation (e.g. Python, R), digital notebooks (iPython) 3SPNHC'15 Kurator/P
  • 4. Customers of Curation Workflows • Collection Managers – … who are managing the collections databases – Can run curation workflows periodically • … in the presence of new data and/or new curation services • (Biodiversity) Researchers – To perform an analysis in the presence of (partially) dirty data, researchers need to • Clean or fix dirty data • Throw out unfixable data – Reporting back to the collection managers (cf. FPush) 4SPNHC'15 Kurator/P
  • 5. How we do it (Part 1 …): Kepler curation workflows • Why workflows? ASAP! SPNHC'15 Kurator/P 5 Dou, Lei., G. Cao, P.J. Morris, R.A. Morris, B. Ludäscher, J.A. Macklin, J. Hanken. 2012. Kurator: A Kepler Package for Data Curation Workflows, Procedia Computer Science, 9:1614-1619, doi:10.1016/j.procs.2012.04.177
  • 6. Scientific Workflows: ASAP! • Automation – wfs to automate computational aspects of science • Scaling (exploit and optimize machine cycles) – wfs should make use of parallel compute resources – wfs should be able handle large data • Abstraction, Evolution, Reuse (human cycles) – wfs should be easy to (re-)use, evolve, share • Provenance – wfs should capture processing history, data lineage  traceable data- and wf-evolution  Reproducible Science Trident Workbench VisTrails SPNHC'15 Kurator/P 6
  • 7. Scientific workflows: a(nother) silver bullet? Beware of the Turing tar-pit in which everything is possible but nothing of interest is easy. —Alan Perlis SPNHC'15 Kurator/P 7
  • 8. I beg your pardon, I never promised .. “Thanks to our Graphical UI your scientific workflows will be much easier to develop, understand and maintain!” Hmm… this was supposed to be easier than programming! SPNHC'15 Kurator/P 8
  • 9. Scientific Workflows … Cabellos et al. Computer Physics Communications 182, 2011 SPNHC'15 Kurator/P 9
  • 10. … are a wonderful thing … SPNHC'15 Kurator/P 10 Norbert Podhorszki (then: UC Davis)
  • 11. … after simplifying a bit (here: Kepler/COMAD) SPNHC'15 Kurator/P 11 Sven Köhler (then: UC Davis)
  • 12. So many systems (models of computation ~ languages, … ) … so little time … SPNHC'15 Kurator/P 12 Sven Köhler (then: UC Davis)
  • 13. SPNHC'15 Kurator/P 13 Workflow Systems: Learning to program, all over again …
  • 14. Scientific Workflow: it’s called R&D for a reason … 14 Workflow Modeling & Design (prospective provenance) Runtime Provenance (traces, retrospective provenance) Fault-tolerance crash recovery Scalability parallel processing SPNHC'15 Kurator/P
  • 15. Meanwhile, on a nearby planet … Highly dynamic visualization (so dynamic, it’s hard to capture) SPNHC'15 Kurator/P 15
  • 16. It’s time to Shift Control … • … back from being consumers of tools “Just click here!” • ... to tool makers! • Kurator/P: – Yes, develop for end users … – … but don’t forget the tool makers! • Can we do this together? SPNHC'15 Kurator/P 16
  • 17. How we do it (Part 2 of … ) • Kurator – Apply workflow technologies and workflow thinking – … in a technology agnostic way (if possible)  Build a library of curation services such that curation workflows can be run from various platforms – Scientific workflow systems • e.g. Restflow, Kepler, Taverna, Galaxy – Other platforms • e.g. Akka, Python-based, … • … leveraging existing technologies 17SPNHC'15 Kurator/P
  • 18. How we do it (Part 3 of … ) • YesWorkflow – Grass-roots effort (goes well with stone-soup…) – Scripts + User annotations  Can give us much of ASAP! • Key Ideas: – Meet the tool makers and researchers (R, Python, …) – Make them workflow/dataflow thinkers … – … but giving them workflow benefits (ASAP!) – … via simple annotations! 18SPNHC'15 Kurator/P
  • 19. SKOPE: Synthesized Knowledge Of Past Environments SPNHC'15 Kurator/P 19 Bocinsky, Kohler et al. study rain-fed maize of Anasazi – Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late 13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio-temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm estimates joint information in tree-rings and a climate signal to identify “best” tree-ring chronologies for climate reconstructing. K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed maize agricultural niche in the US Southwest. Nature Communications. doi:10.1038/ncomms6618 … implemented as an R Script
  • 20. User Comments: YW Annotations SPNHC'15 Kurator/P 20 @begin GO_Analysis @in hgCutoff @in … @out BP_Summl_file @out … @end GO_Analysis ...
  • 21. main fetch_mask input_mask_file load_data input_data_file standardize_with_mask land_water_mask NEE_data simple_diagnose standardized_NEE_data result_NEE_pdf Get 3 views for the price of 1! SPNHC'15 Kurator/P 21 result_NEE_pdf input_mask_file land_water_mask fetch_mask input_data_file NEE_data load_data standardized_NEE_data standardize_with_mask standardize_with_mask simple_diagnose fetch_mask land_water_mask load_data NEE_data standardize_with_mask standardized_NEE_data simple_diagnose result_NEE_pdf input_mask_file input_data_file Process view Data view Combined view
  • 22. Paleoclimate Reconstruction (EnviRecon.org) SPNHC'15 Kurator/P 22 • … explained using YesWorkflow! Kyle B., (computational) archaeologist: "It took me about 20 minutes to comment. Less than an hour to learn and YW-annotate, all-told."
  • 23. The Road ahead … • YesWorkflow: – … finishing support for retrospective provenance without using a runtime provenance recorder! – Key insight: scientists already leave provenance “bread crumbs” behind! (it’s not an accident!) – Exploit that via annotations: URI-templates • Kurator[/P]: – How far can we go towards ASAP via YW? SPNHC'15 Kurator/P 23
  • 25. YW-RECON: Prospective & Retrospective Provenance … (almost) for free! • YW annotations in the script (R, Python, Matlab) are used to recreate the workflow view from the script … SPNHC'15 Kurator/P 25
  • 26. YW-RECON: Prospective & Retrospective Provenance … (almost) for free! SPNHC'15 Kurator/P 26 • URI-templates link conceptual entities to runtime provenance “left behind” by the script author … • … facilitating provenance reconstruction
  • 27. Summary: Data Curation with Scientific Workflow Systems Scientific Workflows • [+] Automation • [+] Scalability • [+] Abstraction • [+] Provenance • … • [+/0] Easy to use – [0] learning a new paradigm • [-] Teaching resources: learning a new language! • [-] Special expertise needed for deep changes e.g. new Java actors, shims, … SPNHC'15 Kurator/P 27
  • 28. Kurator/P: Scripts + YesWorkflow ++ Scripts: [+] Automation, [0] Scalability, [-] Abstraction, [0/-] Provenance Now: Scripts + YesWorkflow Annotations • [+] Abstraction – explain your methods to mere mortals => encourage (re-)use • [+] Provenance: – YesWorkflow (prospective and retrospective provenance) • [+] Language independent (R, Matlab, Python, …) • [+] Empower tool makers (script programmers): give them … – … some immediate benefits (workflow views, retrospective provenance) – … some long term benefits: think about your methods differently => dataflow programming => [+] Scalability SPNHC'15 Kurator/P 28
  • 29. Acknowledgments • NSF-DBI #1356751 – Collaborative Research: ABI Development: Kurator: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data SPNHC'15 Kurator/P 29
  • 31. Date Validation • Check: – Collector’s life span – .. vs. Date-Collected • Possible outcomes: – Valid – Corrected – Unable to validate • Internal inconsistency – Contradicting dates • External inconsistency – Lack of date data 31SPNHC'15 Kurator/P
  • 32. … Logic Behind Each Step (cont’d) • Scientific Name Validation – Customer-dependent: • Collection Managers: – Nomenclature • Researchers: – Taxonomy (current names) – Several Remote services • IPNI, GNI, … • …. <your logic here> … 32SPNHC'15 Kurator/P
  • 33. Simplified Example Workflow • Related Research (Tianhong Song, UC Davis) – Analyze linear workflow “story” – Use patterns to discover wf design issues (e.g. use before update); then fix them – Parallelize when possible SPNHC'15 Kurator/P 33 • Allow easy assembly of such workflows • For tool makers • … and tool users • … scalability …
  • 35. … close up … SPNHC'15 Kurator/P 35
  • 36. FilteredPush Curation Provenance (Spreadsheet View) SPNHC'15 Kurator/P 36
  • 38. Related Research (Tianhong Song, UC Davis) • Analyze linear workflow “story” • Use patterns to discover wf design issues (e.g. use before update); then fix them • Parallelize when possible 38SPNHC'15 Kurator/P
  • 39. Contact me! • If you’re interested in a project, research theme (or similar ones): Send me email! – Email: ludaesch@illinois.edu SPNHC'15 Kurator/P 39