SlideShare uma empresa Scribd logo
1 de 39
Kurator: Towards Data Curation
Workflows for Mere Mortals
An extensible, open-source workflow platform
for users & makers of data curation tools
B. Ludäscher J. Hanken D. Lowery J.A. Macklin
T. McPhillips P.J. Morris R.A. Morris T. Song
Problem: Data & Metadata Quality
• Collections & occurrence data can
be all over the map
• Examples:
– Lat/Long transposition, other geo-
ref issues (projections, …)
– Scientific Names (spelling errors,
other)
– Data entry/creation, “fuzzy” data,
naming issues, bit rot, data
conversions and transformations,
schema mappings, …
• Related:
– Filtered-Push
SPNHC'15 Kurator/P 2
What problems are we trying to solve?
• Detect and flag data quality issues
• Repair if possible
– … ask human curators as needed
• Keep track of provenance
– (semi-)automatic repairs
– human curators’ edits
• Employ workflow (semi-)automation
– Scientific workflow systems:
• Kepler/COMAD, Restflow, Galaxy, Biovel/Taverna, Argo, VisTrails, …
– Related technologies
• Akka parallel execution platform
• Script-based automation (e.g. Python, R), digital notebooks (iPython)
3SPNHC'15 Kurator/P
Customers of Curation Workflows
• Collection Managers
– … who are managing the collections databases
– Can run curation workflows periodically
• … in the presence of new data and/or new curation services
• (Biodiversity) Researchers
– To perform an analysis in the presence of (partially)
dirty data, researchers need to
• Clean or fix dirty data
• Throw out unfixable data
– Reporting back to the collection managers (cf. FPush)
4SPNHC'15 Kurator/P
How we do it (Part 1 …): Kepler curation workflows
• Why workflows? ASAP!
SPNHC'15 Kurator/P 5
Dou, Lei., G. Cao, P.J. Morris, R.A. Morris, B. Ludäscher, J.A. Macklin, J. Hanken. 2012. Kurator: A Kepler Package for
Data Curation Workflows, Procedia Computer Science, 9:1614-1619, doi:10.1016/j.procs.2012.04.177
Scientific Workflows: ASAP!
• Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles)
– wfs should make use of parallel compute resources
– wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles)
– wfs should be easy to (re-)use, evolve, share
• Provenance
– wfs should capture processing history, data lineage
 traceable data- and wf-evolution
 Reproducible Science Trident
Workbench
VisTrails
SPNHC'15 Kurator/P 6
Scientific workflows: a(nother) silver bullet?
Beware of the Turing tar-pit in which everything is
possible but nothing of interest is easy.
—Alan Perlis
SPNHC'15 Kurator/P 7
I beg your pardon, I never promised
..
“Thanks to our Graphical UI your scientific workflows will be much
easier to develop, understand and maintain!”
Hmm… this was supposed to be easier than programming!
SPNHC'15 Kurator/P 8
Scientific Workflows …
Cabellos et al. Computer Physics Communications 182, 2011
SPNHC'15 Kurator/P 9
… are a wonderful thing …
SPNHC'15 Kurator/P 10
Norbert Podhorszki
(then: UC Davis)
… after simplifying a bit (here: Kepler/COMAD)
SPNHC'15 Kurator/P 11
Sven Köhler
(then: UC Davis)
So many systems (models of computation ~ languages, … )
… so little time …
SPNHC'15 Kurator/P 12
Sven Köhler
(then: UC Davis)
SPNHC'15 Kurator/P 13
Workflow Systems: Learning to program, all over again …
Scientific Workflow: it’s called R&D for a reason …
14
Workflow Modeling & Design
(prospective provenance)
Runtime Provenance
(traces,
retrospective
provenance)
Fault-tolerance
crash recovery
Scalability
parallel processing
SPNHC'15 Kurator/P
Meanwhile, on a nearby planet …
Highly dynamic visualization
(so dynamic, it’s hard to capture)
SPNHC'15 Kurator/P 15
It’s time to Shift Control …
• … back from being consumers of tools
“Just click here!”
• ... to tool makers!
• Kurator/P:
– Yes, develop for end users …
– … but don’t forget the tool makers!
• Can we do this together?
SPNHC'15 Kurator/P 16
How we do it (Part 2 of … )
• Kurator
– Apply workflow technologies and workflow thinking
– … in a technology agnostic way (if possible)
 Build a library of curation services such that curation
workflows can be run from various platforms
– Scientific workflow systems
• e.g. Restflow, Kepler, Taverna, Galaxy
– Other platforms
• e.g. Akka, Python-based, …
• … leveraging existing technologies
17SPNHC'15 Kurator/P
How we do it (Part 3 of … )
• YesWorkflow
– Grass-roots effort (goes well with stone-soup…)
– Scripts + User annotations
 Can give us much of ASAP!
• Key Ideas:
– Meet the tool makers and researchers (R, Python, …)
– Make them workflow/dataflow thinkers …
– … but giving them workflow benefits (ASAP!)
– … via simple annotations!
18SPNHC'15 Kurator/P
SKOPE: Synthesized Knowledge Of Past Environments
SPNHC'15 Kurator/P 19
Bocinsky, Kohler et al. study rain-fed maize of Anasazi
– Four Corners; AD 600–1500. Climate change influenced Mesa Verde
Migrations; late 13th century AD. Uses network of tree-ring chronologies to
reconstruct a spatio-temporal climate field at a fairly high resolution (~800 m)
from AD 1–2000. Algorithm estimates joint information in tree-rings and a
climate signal to identify “best” tree-ring chronologies for climate
reconstructing.
K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed
maize agricultural niche in the US Southwest. Nature
Communications. doi:10.1038/ncomms6618
… implemented as an R Script
User Comments: YW Annotations
SPNHC'15 Kurator/P 20
@begin GO_Analysis
@in hgCutoff
@in …
@out BP_Summl_file
@out …
@end GO_Analysis
...
main
fetch_mask
input_mask_file
load_data
input_data_file standardize_with_mask
land_water_mask
NEE_data simple_diagnose
standardized_NEE_data result_NEE_pdf
Get 3 views for the price of 1!
SPNHC'15 Kurator/P 21
result_NEE_pdf
input_mask_file land_water_mask
fetch_mask
input_data_file NEE_data
load_data
standardized_NEE_data
standardize_with_mask
standardize_with_mask
simple_diagnose
fetch_mask land_water_mask
load_data NEE_data
standardize_with_mask standardized_NEE_data simple_diagnose result_NEE_pdf
input_mask_file
input_data_file
Process view
Data view
Combined view
Paleoclimate Reconstruction (EnviRecon.org)
SPNHC'15 Kurator/P 22
• … explained using YesWorkflow!
Kyle B., (computational) archaeologist:
"It took me about 20 minutes to comment. Less
than an hour to learn and YW-annotate, all-told."
The Road ahead …
• YesWorkflow:
– … finishing support for retrospective provenance
without using a runtime provenance recorder!
– Key insight: scientists already leave provenance “bread
crumbs” behind! (it’s not an accident!)
– Exploit that via annotations: URI-templates
• Kurator[/P]:
– How far can we go towards ASAP via YW?
SPNHC'15 Kurator/P 23
YesWorkflow.org
SPNHC'15 Kurator/P 24
YW-RECON: Prospective & Retrospective
Provenance … (almost) for free!
• YW annotations in
the script (R,
Python, Matlab)
are used to
recreate the
workflow view
from the script …
SPNHC'15 Kurator/P 25
YW-RECON: Prospective & Retrospective
Provenance … (almost) for free!
SPNHC'15 Kurator/P 26
• URI-templates link conceptual entities
to runtime provenance “left behind” by
the script author …
• … facilitating provenance reconstruction
Summary: Data Curation with
Scientific Workflow Systems
Scientific Workflows
• [+] Automation
• [+] Scalability
• [+] Abstraction
• [+] Provenance
• …
• [+/0] Easy to use
– [0] learning a new paradigm
• [-] Teaching resources: learning a new language!
• [-] Special expertise needed for deep changes
e.g. new Java actors, shims, …
SPNHC'15 Kurator/P 27
Kurator/P: Scripts + YesWorkflow ++
Scripts: [+] Automation, [0] Scalability, [-] Abstraction, [0/-] Provenance
Now: Scripts + YesWorkflow Annotations
• [+] Abstraction
– explain your methods to mere mortals
=> encourage (re-)use
• [+] Provenance:
– YesWorkflow (prospective and retrospective provenance)
• [+] Language independent (R, Matlab, Python, …)
• [+] Empower tool makers (script programmers): give them …
– … some immediate benefits (workflow views, retrospective provenance)
– … some long term benefits: think about your methods differently
=> dataflow programming => [+] Scalability
SPNHC'15 Kurator/P 28
Acknowledgments
• NSF-DBI #1356751
– Collaborative Research: ABI Development:
Kurator: A Provenance-enabled Workflow Platform
and Toolkit to Curate Biodiversity Data
SPNHC'15 Kurator/P 29
Additional Material
Date Validation
• Check:
– Collector’s life span
– .. vs. Date-Collected
• Possible outcomes:
– Valid
– Corrected
– Unable to validate
• Internal inconsistency
– Contradicting dates
• External inconsistency
– Lack of date data
31SPNHC'15 Kurator/P
… Logic Behind Each Step (cont’d)
• Scientific Name Validation
– Customer-dependent:
• Collection Managers:
– Nomenclature
• Researchers:
– Taxonomy (current names)
– Several Remote services
• IPNI, GNI, …
• …. <your logic here> …
32SPNHC'15 Kurator/P
Simplified Example Workflow
• Related Research (Tianhong Song, UC Davis)
– Analyze linear workflow “story”
– Use patterns to discover wf design issues
(e.g. use before update); then fix them
– Parallelize when possible
SPNHC'15 Kurator/P 33
• Allow easy
assembly of such
workflows
• For tool makers
• … and tool users
• … scalability …
Example Output …
SPNHC'15 Kurator/P 34
… close up …
SPNHC'15 Kurator/P 35
FilteredPush Curation Provenance
(Spreadsheet View)
SPNHC'15 Kurator/P 36
Agile Kurator Development
37SPNHC'15 Kurator/P
Related Research (Tianhong Song, UC Davis)
• Analyze linear workflow
“story”
• Use patterns to discover wf
design issues (e.g. use before
update); then fix them
• Parallelize when possible
38SPNHC'15 Kurator/P
Contact me!
• If you’re interested in a project, research theme
(or similar ones): Send me email!
– Email: ludaesch@illinois.edu
SPNHC'15 Kurator/P 39

Mais conteúdo relacionado

Semelhante a Kurator: Towards Data Curation for Mere Mortals

Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science researchAnubhav Jain
 
Scilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersScilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersNaren P.R.
 
GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)Bertram Ludäscher
 
Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13Naren P.R.
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
 
Personal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingPersonal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingChawanat Nakasan
 
HPC Resource Accounting
HPC Resource AccountingHPC Resource Accounting
HPC Resource AccountingKen Schumacher
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_sparkGeetanjali G
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsIlkay Altintas, Ph.D.
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Anubhav Jain
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 

Semelhante a Kurator: Towards Data Curation for Mere Mortals (20)

Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
week15a.pdf
week15a.pdfweek15a.pdf
week15a.pdf
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
Scilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersScilab: Computing Tool For Engineers
Scilab: Computing Tool For Engineers
 
GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)
 
Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Personal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingPersonal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research Meeting
 
Dancing with Stream Processing
Dancing with Stream ProcessingDancing with Stream Processing
Dancing with Stream Processing
 
HPC Resource Accounting
HPC Resource AccountingHPC Resource Accounting
HPC Resource Accounting
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_spark
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable Workflows
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 

Mais de Bertram Ludäscher

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Bertram Ludäscher
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database RulesBertram Ludäscher
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database RulesBertram Ludäscher
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsBertram Ludäscher
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueBertram Ludäscher
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesBertram Ludäscher
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseBertram Ludäscher
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...Bertram Ludäscher
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...Bertram Ludäscher
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...Bertram Ludäscher
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsBertram Ludäscher
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsBertram Ludäscher
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachBertram Ludäscher
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchBertram Ludäscher
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatBertram Ludäscher
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceBertram Ludäscher
 

Mais de Bertram Ludäscher (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A Dialogue
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflows
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of Research
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's Seat
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable Provenance
 

Último

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 

Último (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

Kurator: Towards Data Curation for Mere Mortals

  • 1. Kurator: Towards Data Curation Workflows for Mere Mortals An extensible, open-source workflow platform for users & makers of data curation tools B. Ludäscher J. Hanken D. Lowery J.A. Macklin T. McPhillips P.J. Morris R.A. Morris T. Song
  • 2. Problem: Data & Metadata Quality • Collections & occurrence data can be all over the map • Examples: – Lat/Long transposition, other geo- ref issues (projections, …) – Scientific Names (spelling errors, other) – Data entry/creation, “fuzzy” data, naming issues, bit rot, data conversions and transformations, schema mappings, … • Related: – Filtered-Push SPNHC'15 Kurator/P 2
  • 3. What problems are we trying to solve? • Detect and flag data quality issues • Repair if possible – … ask human curators as needed • Keep track of provenance – (semi-)automatic repairs – human curators’ edits • Employ workflow (semi-)automation – Scientific workflow systems: • Kepler/COMAD, Restflow, Galaxy, Biovel/Taverna, Argo, VisTrails, … – Related technologies • Akka parallel execution platform • Script-based automation (e.g. Python, R), digital notebooks (iPython) 3SPNHC'15 Kurator/P
  • 4. Customers of Curation Workflows • Collection Managers – … who are managing the collections databases – Can run curation workflows periodically • … in the presence of new data and/or new curation services • (Biodiversity) Researchers – To perform an analysis in the presence of (partially) dirty data, researchers need to • Clean or fix dirty data • Throw out unfixable data – Reporting back to the collection managers (cf. FPush) 4SPNHC'15 Kurator/P
  • 5. How we do it (Part 1 …): Kepler curation workflows • Why workflows? ASAP! SPNHC'15 Kurator/P 5 Dou, Lei., G. Cao, P.J. Morris, R.A. Morris, B. Ludäscher, J.A. Macklin, J. Hanken. 2012. Kurator: A Kepler Package for Data Curation Workflows, Procedia Computer Science, 9:1614-1619, doi:10.1016/j.procs.2012.04.177
  • 6. Scientific Workflows: ASAP! • Automation – wfs to automate computational aspects of science • Scaling (exploit and optimize machine cycles) – wfs should make use of parallel compute resources – wfs should be able handle large data • Abstraction, Evolution, Reuse (human cycles) – wfs should be easy to (re-)use, evolve, share • Provenance – wfs should capture processing history, data lineage  traceable data- and wf-evolution  Reproducible Science Trident Workbench VisTrails SPNHC'15 Kurator/P 6
  • 7. Scientific workflows: a(nother) silver bullet? Beware of the Turing tar-pit in which everything is possible but nothing of interest is easy. —Alan Perlis SPNHC'15 Kurator/P 7
  • 8. I beg your pardon, I never promised .. “Thanks to our Graphical UI your scientific workflows will be much easier to develop, understand and maintain!” Hmm… this was supposed to be easier than programming! SPNHC'15 Kurator/P 8
  • 9. Scientific Workflows … Cabellos et al. Computer Physics Communications 182, 2011 SPNHC'15 Kurator/P 9
  • 10. … are a wonderful thing … SPNHC'15 Kurator/P 10 Norbert Podhorszki (then: UC Davis)
  • 11. … after simplifying a bit (here: Kepler/COMAD) SPNHC'15 Kurator/P 11 Sven Köhler (then: UC Davis)
  • 12. So many systems (models of computation ~ languages, … ) … so little time … SPNHC'15 Kurator/P 12 Sven Köhler (then: UC Davis)
  • 13. SPNHC'15 Kurator/P 13 Workflow Systems: Learning to program, all over again …
  • 14. Scientific Workflow: it’s called R&D for a reason … 14 Workflow Modeling & Design (prospective provenance) Runtime Provenance (traces, retrospective provenance) Fault-tolerance crash recovery Scalability parallel processing SPNHC'15 Kurator/P
  • 15. Meanwhile, on a nearby planet … Highly dynamic visualization (so dynamic, it’s hard to capture) SPNHC'15 Kurator/P 15
  • 16. It’s time to Shift Control … • … back from being consumers of tools “Just click here!” • ... to tool makers! • Kurator/P: – Yes, develop for end users … – … but don’t forget the tool makers! • Can we do this together? SPNHC'15 Kurator/P 16
  • 17. How we do it (Part 2 of … ) • Kurator – Apply workflow technologies and workflow thinking – … in a technology agnostic way (if possible)  Build a library of curation services such that curation workflows can be run from various platforms – Scientific workflow systems • e.g. Restflow, Kepler, Taverna, Galaxy – Other platforms • e.g. Akka, Python-based, … • … leveraging existing technologies 17SPNHC'15 Kurator/P
  • 18. How we do it (Part 3 of … ) • YesWorkflow – Grass-roots effort (goes well with stone-soup…) – Scripts + User annotations  Can give us much of ASAP! • Key Ideas: – Meet the tool makers and researchers (R, Python, …) – Make them workflow/dataflow thinkers … – … but giving them workflow benefits (ASAP!) – … via simple annotations! 18SPNHC'15 Kurator/P
  • 19. SKOPE: Synthesized Knowledge Of Past Environments SPNHC'15 Kurator/P 19 Bocinsky, Kohler et al. study rain-fed maize of Anasazi – Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late 13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio-temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm estimates joint information in tree-rings and a climate signal to identify “best” tree-ring chronologies for climate reconstructing. K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed maize agricultural niche in the US Southwest. Nature Communications. doi:10.1038/ncomms6618 … implemented as an R Script
  • 20. User Comments: YW Annotations SPNHC'15 Kurator/P 20 @begin GO_Analysis @in hgCutoff @in … @out BP_Summl_file @out … @end GO_Analysis ...
  • 21. main fetch_mask input_mask_file load_data input_data_file standardize_with_mask land_water_mask NEE_data simple_diagnose standardized_NEE_data result_NEE_pdf Get 3 views for the price of 1! SPNHC'15 Kurator/P 21 result_NEE_pdf input_mask_file land_water_mask fetch_mask input_data_file NEE_data load_data standardized_NEE_data standardize_with_mask standardize_with_mask simple_diagnose fetch_mask land_water_mask load_data NEE_data standardize_with_mask standardized_NEE_data simple_diagnose result_NEE_pdf input_mask_file input_data_file Process view Data view Combined view
  • 22. Paleoclimate Reconstruction (EnviRecon.org) SPNHC'15 Kurator/P 22 • … explained using YesWorkflow! Kyle B., (computational) archaeologist: "It took me about 20 minutes to comment. Less than an hour to learn and YW-annotate, all-told."
  • 23. The Road ahead … • YesWorkflow: – … finishing support for retrospective provenance without using a runtime provenance recorder! – Key insight: scientists already leave provenance “bread crumbs” behind! (it’s not an accident!) – Exploit that via annotations: URI-templates • Kurator[/P]: – How far can we go towards ASAP via YW? SPNHC'15 Kurator/P 23
  • 25. YW-RECON: Prospective & Retrospective Provenance … (almost) for free! • YW annotations in the script (R, Python, Matlab) are used to recreate the workflow view from the script … SPNHC'15 Kurator/P 25
  • 26. YW-RECON: Prospective & Retrospective Provenance … (almost) for free! SPNHC'15 Kurator/P 26 • URI-templates link conceptual entities to runtime provenance “left behind” by the script author … • … facilitating provenance reconstruction
  • 27. Summary: Data Curation with Scientific Workflow Systems Scientific Workflows • [+] Automation • [+] Scalability • [+] Abstraction • [+] Provenance • … • [+/0] Easy to use – [0] learning a new paradigm • [-] Teaching resources: learning a new language! • [-] Special expertise needed for deep changes e.g. new Java actors, shims, … SPNHC'15 Kurator/P 27
  • 28. Kurator/P: Scripts + YesWorkflow ++ Scripts: [+] Automation, [0] Scalability, [-] Abstraction, [0/-] Provenance Now: Scripts + YesWorkflow Annotations • [+] Abstraction – explain your methods to mere mortals => encourage (re-)use • [+] Provenance: – YesWorkflow (prospective and retrospective provenance) • [+] Language independent (R, Matlab, Python, …) • [+] Empower tool makers (script programmers): give them … – … some immediate benefits (workflow views, retrospective provenance) – … some long term benefits: think about your methods differently => dataflow programming => [+] Scalability SPNHC'15 Kurator/P 28
  • 29. Acknowledgments • NSF-DBI #1356751 – Collaborative Research: ABI Development: Kurator: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data SPNHC'15 Kurator/P 29
  • 31. Date Validation • Check: – Collector’s life span – .. vs. Date-Collected • Possible outcomes: – Valid – Corrected – Unable to validate • Internal inconsistency – Contradicting dates • External inconsistency – Lack of date data 31SPNHC'15 Kurator/P
  • 32. … Logic Behind Each Step (cont’d) • Scientific Name Validation – Customer-dependent: • Collection Managers: – Nomenclature • Researchers: – Taxonomy (current names) – Several Remote services • IPNI, GNI, … • …. <your logic here> … 32SPNHC'15 Kurator/P
  • 33. Simplified Example Workflow • Related Research (Tianhong Song, UC Davis) – Analyze linear workflow “story” – Use patterns to discover wf design issues (e.g. use before update); then fix them – Parallelize when possible SPNHC'15 Kurator/P 33 • Allow easy assembly of such workflows • For tool makers • … and tool users • … scalability …
  • 35. … close up … SPNHC'15 Kurator/P 35
  • 36. FilteredPush Curation Provenance (Spreadsheet View) SPNHC'15 Kurator/P 36
  • 38. Related Research (Tianhong Song, UC Davis) • Analyze linear workflow “story” • Use patterns to discover wf design issues (e.g. use before update); then fix them • Parallelize when possible 38SPNHC'15 Kurator/P
  • 39. Contact me! • If you’re interested in a project, research theme (or similar ones): Send me email! – Email: ludaesch@illinois.edu SPNHC'15 Kurator/P 39