7. Computational Provenance …
• Origin, processing history of artifacts
– data products, figures, ...
– also: underlying workflow
è understand methods, dataflow, and dependencies
Ludäscher: Workflows & Provenance => Understanding 7
Climate Change Impacts
in the United States
U.S. National Climate Assessment
U.S. Global Change Research Program
12. • … NSF SKOPE: system and tools to discover,
access, analyze, visualize paleoenvironmental
data
– unprecedented ability to explore provenance
(detailed, comprehensible record of computational
derivation of results)
– for researchers, tinkerers, and modelers
• … NSF Whole Tale:
– leverage & contribute to existing CI to support the
whole tale (“living paper”), from workflow run to
scholarly publication
– integrate tools & CI (DataONE, Globus, iRODS,
NDS, ...) to simplify use and promote best
practices.
– driven by science WGs (Archaeology/SKOPE,
materials science, astro, bio ..)
Related Projects: NSF DataONE (ProvONE ..) + …
Ludäscher: Workflows & Provenance => Understanding 12
14. SKOPE: Synthesized Knowledge Of Past Environments
Bocinsky, Kohler et al. study rain-fed maize of Anasazi
– Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late
13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio-
temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm
estimates joint information in tree-rings and a climate signal to identify “best” tree-ring
chronologies for climate reconstructing.
K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed
maize agricultural niche in the US Southwest. Nature
Communications. doi:10.1038/ncomms6618
… implemented as an R Script …
Ludäscher: Workflows & Provenance => Understanding 14
17. YW Demo Use Cases (IDCC’17)
Domain Use case Programming language Provenance methods
Climate science C3C4 MATLAB YW + MATLAB
RunManager
Astrophysics LIGO Python YW + NW (code-level)
Protein crystal samples Simulate data
collection
Python YW + NW (code-level)
Biodiversity data
curation
kurator-SPNHC Python YW-recon + YW-logging
Social network analysis Twitter Python YW + NW (file-level)
Oceanography OHIBC Howe Sound
(multi-run multi-script)
R YW + R RunManager
Ludäscher: Workflows & Provenance => Understanding 17
23. Hybrid Provenance:
YW Model + Runtime
Observables (file level)
Ludäscher: Workflows & Provenance => Understanding
23
�����������������
�����
���������
��������������
����������������
����������
�����������������
����������������
�������
����������
������������������
����������������
�����������������
�������������������
�����������
������������������
����������
�����������������
�����������
������������
�������������
���������������������
�������������������������������������������������������������������
�����������������
�������������������������������������������������������������������������
• The YW model can be connected
with runtime observables
• è YW recon (prov reconstruction)
• Here:
• What specific files were read,
written and where do they occur
in the workflow?
43. Adding YesWorkflow to DataONE
Yaxing’s script with
inputs & output
products
Christopher’s
YesWorkflow
model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results
can be traced back all
the way to Yaxing’s
input
Ludäscher: Workflows & Provenance => Understanding
43
46. Whole Tale: What’s in a name?
(1) Whole Tale ⇔ Whole Story:
◦ Support (computational / data) scientists
◦ … along the complete research lifecycle
◦ ... from experiment to (new kind of) publication
◦ ... and back!
(2) Whole Tale ⇔ for the Long Tail of Science
–Easy sharing of your computational narratives, data, and
exec-env since 2017!
–Power applications for everyone!
46Ludäscher: Workflows & Provenance => Understanding
47. Whole Tale Vision
• Can't reproduce result because:
• Don't know how to run analysis
• Can't get the software running
• Can't pay for the computer or compute
power the result was computed on
Source: Bryce Mecum, NCEAS (WT team)
47
54. Last not least:
Non-unitary syntheses
of systematic knowledge
Please
@taxonbytes
Nico Franz
School of Life Sciences, Arizona State University
CIRSS Seminar – Center for Informatics Research in Science and Scholarship
February 17, 2017 – iSchool, University of Illinois Urbana-Champaign
@ http://www.slideshare.net/taxonbytes/franz-2017-uiuc-cirss-non-unitary-syntheses-of-systematic-knowledge 54
57. Use case 1.a. Aligning Microcebus + Mirza sec. MSW3 (2005)
"Taxonomic concept labels"
identify input concept regions
RCC–5 articulations provided
for each species-level concept
• Input visualization: MSW3 (2005) versus MSW2 (1993)
Source: Franz et al. 2016. Two influential primate classifications logical aligned. doi:10.1093/sysbio/syw023
57
58. • Alignment visualization: "grey means taxonomically congruent"
Use case 1.a. Aligning Microcebus + Mirza sec. MSW3 (2005)
58
59. One name &
congruent region
Many names &
congruent region
One name &
non-congruent regions
Many names &
non-congruent regions
New names &
exclusive regions
• Application of coverage constraint: parent-to-parent articulations (><) are
fully defined by alignment signal propagated from their respective children.
è Sensible when complete sampling of children is intended.
Use case 1.a. Aligning Microcebus + Mirza sec. MSW3 (2005)
• Alignment visualization: "grey means taxonomically congruent"
59
60. 1 in 3 names is unreliable across MSW2/MSW3 classifications
Source: Franz et al. 2016. Two influential primate classifications logical aligned. doi:10.1093/sysbio/syw023
60
61. The 'consensus' The
'bible'
The (formerly)
federal
'standard'
The 'best', latest
regional flora
"Controllingthetaxonomicvariable"
Expert views
are in
conflict
"Just bad"
Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610
61
62. The 'consensus' The
'bible'
The (formerly)
federal
'standard'
The 'best', latest
regional flora
Impact:
Name-based aggregation has created
a novel synthesis that nobody believes in
"Controllingthetaxonomicvariable"
"Just bad"
Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610
62
63. The 'consensus' The
'bible'
The (formerly)
federal
'standard'
The 'best', latest
regional flora
"Controllingthetaxonomicvariable"
"Just
bad"
Expert views
are
reconciled
Solution:
Instead of aggregating
an artificial 'consensus',
build translation services
Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610
63
65. Yi-Yun Cheng1, Nico Franz2, Jodi Schneider1, Shizhuo Yu3, Thomas Rodenhausen4, Bertram Ludäscher1
1
School of Information Sciences, University of Illinois at Urbana-Champaign; 2
School of Life Sciences, Arizona State University;
3
Department of Computer Science, University of California at Davis; 4
School of Information, University of Arizona
Agreeing to Disagree: Reconciling Conflicting Taxonomic Views
using a Logic-based Approach
Acknowledgments
Support of the authors’ research through the National Science
Foundation is kindly acknowledged (DEB-1155984, DBI-1342595, and
DBI-1643002). The authors thank Professor Kathryn La Barre for her
comments and suggestions. We would also like to thank Dr. Laetitia
Navarro and Jeff Terstriep for help with creating map overlays in QGIS.
CONCLUSION
• Our logic-based taxonomy alignment approach can be used to solve
crosswalking issues
We will be able to mitigate the membership condition problems that
occur in equivalent crosswalking.
• RCC-5 approach preserves the original taxonomies while providing an
alignment view
We can solve data integration problems that happen in the more
coarse-grained relative crosswalking, which otherwise is subjected to
information loss.
• Our study also underscores the benefits of designing different
alignment workflows (Bottom up vs. Top-down) to match the needs
of specific taxonomy alignment problems
Bottom-up approach: seems to work well whenever we have non-
overlapping relationships at the leaf-level (lowest-level) articulations,
and we are not sure how the higher-level concepts should be aligned.
Top-down approach: seems favorable when there is an expectation of
certain higher-level articulations in conjunction with under-specified,
complex, and often overlapping leaf-level relations.
RELATED WORK
• Taxonomy Alignment Problems (TAP)
Taxonomies T1, T2 are inter-linked via a set of input articulations A,
defined as RCC-5 relations, to yield a “merged” taxonomy T3 .
• Euler/X
Articulations – a constraint or rule that defines a relationship (a set
constraint) between two concepts from different taxonomies .
Region Connection Calculus (RCC-5)
Possible Worlds – When encoding and solving TAPs via ASP, the
different answer sets represent alternative taxonomy merge solutions
or possible worlds (PWs).
INTRODUCTION
Tina: Hey Amy, can you recommend a signature dish from where you
live?
Amy: Oh, definitely the half-smokes from the Northeast! They are
these tasty half-pork and half-beef sausages.
Tina: What a coincidence! We have half-smokes in the South, too!
Where do you live in the Northeast? New York? Boston?
Amy: Wrong guesses! Where do you live in the South?
Tina and Amy together: Washington, D.C.
[The two of them look at each other, confused.]
“In the face of incompatible information or data structures among
users or among those specifying the system, attempts to create
unitary knowledge categories are futile. Rather, parallel or multiple
representational forms are required…” (Bowker & Star, 2000).
CASE 1 RESULTS: CEN vs. NDC
• State-level alignments are all congruent (Bottom-up)
• Inferred new articulations for regional-level alignments
CASE 2 RESULTS: CEN vs. TZ
Figure 3. (Left) CEN-NDC taxonomy alignment problem with 49 input articulations between TCEN and TNDC
Figure 4. (Right) The unique possible world (PW) T3 reconciling TCEN and TNDC via inferred relationships
Figure 1. National Diversity Council map (NDC) vs. Census Bureau map (CEN)
• Github link:
https://github.com/EulerProject/ASIST17
• Email: yiyunyc2@illinois.edu
West
Southwest Southeast
Midwest North-
east
West
South
Midwest North-
east
Pacific
Mountain
Central
Eastern
West
South
Midwest
North-
east
RESEARCH DESIGN
Step 1. Supply input taxonomies T1 and T2
Step 2. Formulate RCC-5 articulations between T1 and T2
Step 3. Iteratively edit articulations in Euler/X
Y X X YX Y X Y X Y
Congruence
X == Y
Inclusion
X > Y
Inverse Inclusion
X < Y
Overlap
X>< Y
Disjointness
X ! Y
T1
T2
T1
T2
Inconsistent (N=0)
Ambiguous (N>1)
T3
Add/Edit
Articulations A
Euler/X
N Possible Worlds
N=1 N=0 or N>1
R1
R2
R3
R4
R5
R6
R7
R8
R9
CEN.Midwest
CEN.USA
TZ.USA
CEN.West
CEN.Northeast
TZ.EasternCEN.Midwest
TZ.EasternCEN.South
CEN.South
CEN.South*TZ.Central
TZ.CentralCEN.Midwest
CEN.SouthTZ.Eastern
CEN.SouthTZ.Mountain
TZ.Central
CEN.MidwestTZ.Eastern
TZ.MountainCEN.South
TZ.Mountain
CEN.MidwestTZ.Mountain
TZ.MountainCEN.Midwest
CEN.Midwest*TZ.Mountain
CEN.MidwestTZ.Central
TZ.MountainCEN.West
CEN.Midwest*TZ.Eastern
CEN.West*TZ.Mountain
CEN.South*TZ.Mountain
CEN.SouthTZ.Central
TZ.Eastern
CEN.South*TZ.Eastern
CEN.Midwest*TZ.Central
TZ.CentralCEN.South
TZ.Pacific
CEN.WestTZ.Mountain
Nodes
CEN 4
newComb 18
comb 1
TZ 4
Edges
input 6
inferred 37
CEN.IL NDC.IL==
CEN.IN NDC.IN
==
CEN.RI NDC.RI==
CEN.IA NDC.IA==
CEN.WV NDC.WV
==
CEN.KS NDC.KS==
CEN.KY NDC.KY==
CEN.TX
NDC.TX
==
CEN.Northeast
CEN.VT
CEN.MA
CEN.ME
CEN.CT
CEN.PA
CEN.NY
CEN.NH
CEN.NJ
CEN.South
CEN.TN
CEN.MS
CEN.MD
CEN.DC
CEN.DE
CEN.VA
CEN.FL
CEN.AR
CEN.AL
CEN.OK
CEN.SC
CEN.LA
CEN.GA
CEN.NC
CEN.ID NDC.ID==
NDC.TN==
CEN.WY NDC.WY==
NDC.VT==
NDC.MS==
CEN.MT NDC.MT==
NDC.MA
==
CEN.USA
CEN.Midwest
CEN.West
NDC.ME==
NDC.MD==
CEN.MI NDC.MI==
CEN.MN NDC.MN==
NDC.DC==
NDC.DE==
CEN.OR NDC.OR==
CEN.OH NDC.OH==
NDC.VA==
NDC.FL==
NDC.AR==
CEN.AZ NDC.AZ==
NDC.AL==
NDC.OK
==
NDC.CT==
CEN.CO NDC.CO
==
CEN.CA NDC.CA==
CEN.SD NDC.SD
==
NDC.SC==
CEN.MO
CEN.ND
CEN.NE
CEN.WI
NDC.LA==
NDC.MO==
CEN.UT NDC.UT==
NDC.GA==
NDC.PA==
CEN.NV
CEN.NM
CEN.WA
NDC.NY==
NDC.NV==
NDC.NM==
NDC.WA
==
NDC.NH==
NDC.NJ==
NDC.ND==
NDC.NE==
NDC.WI==
NDC.NC==
NDC.West
NDC.Midwest
NDC.Northeast
NDC.Southeast
NDC.USA
NDC.Southwest
Nodes
CEN 54
NDC 55
Edges
isa_CEN 53
isa_NDC 54
Art. 49
CEN.West
NDC.Southwest
CEN.USA
NDC.USA
CEN.Northeast
NDC.Northeast
CEN.South
NDC.Southeast
NDC.West
CEN.DC
NDC.DC
CEN.NM
NDC.NM
CEN.ND
NDC.ND
CEN.Midwest
NDC.Midwest
CEN.AZ
NDC.AZ
CEN.CA
NDC.CA
CEN.MT
NDC.MT
CEN.MA
NDC.MA
CEN.IN
NDC.IN
CEN.NV
NDC.NV
CEN.MD
NDC.MD
CEN.CT
NDC.CT
CEN.NH
NDC.NH
CEN.KY
NDC.KY
CEN.PA
NDC.PA
CEN.CO
NDC.CO
CEN.WA
NDC.WA
CEN.MI
NDC.MI
CEN.VA
NDC.VA
CEN.WI
NDC.WI
CEN.NE
NDC.NE
CEN.SD
NDC.SD
CEN.MN
NDC.MN
CEN.MS
NDC.MS
CEN.ID
NDC.ID
CEN.WV
NDC.WV
CEN.NY
NDC.NY
CEN.NJ
NDC.NJ
CEN.UT
NDC.UT
CEN.ME
NDC.ME
CEN.IL
NDC.IL
CEN.TN
NDC.TN
CEN.VT
NDC.VT
CEN.GA
NDC.GA
CEN.DE
NDC.DE
CEN.NC
NDC.NC
CEN.OK
NDC.OK
CEN.MO
NDC.MO
CEN.SC
NDC.SC
CEN.AR
NDC.AR
CEN.TX
NDC.TX
CEN.LA
NDC.LA
CEN.OH
NDC.OH
CEN.IA
NDC.IA
CEN.KS
NDC.KS
CEN.RI
NDC.RI
CEN.WY
NDC.WY
CEN.FL
NDC.FL
CEN.OR
NDC.OR
CEN.AL
NDC.AL
Nodes
CEN 3
NDC 4
comb 51
Edges
input 61
inferred 3
overlapsinferred 3
CEN.Northeast
TZ.Eastern
<
CEN.Midwest
><
TZ.Mountain
><
TZ.Pacific
!
CEN.South
><
><
!
TZ.Central
><
CEN.USA
CEN.West
TZ.USA
==
!
><
!
Nodes
CEN 5
TZ 5
Edges
isa_CEN 4
isa_TZ 4
Art. 12
CEN.Midwest
CEN.USA
TZ.USA
TZ.Eastern
TZ.Central
TZ.Mountain
CEN.South
CEN.Northeast
CEN.West TZ.Pacific
Nodes
CEN 4
comb 1
TZ 4
Edges
input 7
overlapsinput 6
overlapsinferred 1
R1
R2
R3
R4
R5
R6
R7
R8
R9
Figure 2. The process of aligning
taxonomies T1 and T2 with Euler/X
Figure 5. Top-down
input alignments
between TCEN and TTZ
Figure 6. The unique
PW for the TCEN with
TTZ alignment
Figure 10. Combined concepts
solution for TCEN and TTZ
taxonomy CEN Census_Regions
(USA Northeast Midwest South West)
(Northeast CT MA ME NH NJ NY PA RI VT)
(Midwest IL IN IA KS MI MN MO NE ND OH
SD WI)
(South AL AR DE DC FL GA KY LA MD MS NC
OK SC TN TX VA WV)
(West AZ CA CO ID MT NV NM OR UT WA WY)
taxonomy NDC
National_Diversity_Council
(USA Midwest Northeast Southeast
Southwest West)
(Northeast CT DC DE MD MA ME NH NJ NY
PA RI VT)
(Midwest IA IL IN KS MI MN MO ND NE OH
SD WI)
(Southeast AL AR FL GA KY LA MS NC SC
TN VA WV)
(Southwest AZ NM OK TX)
(West CA CO ID MT NV OR WA WY UT)
articulations CEN NDC
[CEN.AL equals NDC.AL]
[CEN.AR equals NDC.AR]
[CEN.AZ equals NDC.AZ]
[CEN.CA equals NDC.CA]
[CEN.CO equals NDC.CO]
[CEN.CT equals NDC.CT]
[CEN.DC equals NDC.DC]
[CEN.DE equals NDC.DE]
[CEN.FL equals NDC.FL]
[CEN.GA equals NDC.GA]
[CEN.IA equals NDC.IA]
[CEN.ID equals NDC.ID]
[CEN.IL equals NDC.IL]
[CEN.IN equals NDC.IN]
[CEN.KS equals NDC.KS]
[CEN.KY equals NDC.KY]
[CEN.LA equals NDC.LA]
[CEN.MA equals NDC.MA]
[CEN.MD equals NDC.MD]
[CEN.ME equals NDC.ME]
[CEN.MI equals NDC.MI]
[CEN.MN equals NDC.MN]
...
Quick Scan!
taxonomy CEN Census_Regions
(USA Midwest South West Northeast)
taxonomy TZ Time_Zone
(USA Pacific Mountain Central Eastern)
articulations CEN TZ
[CEN.Midwest disjoint TZ.Pacific]
[CEN.Midwest overlaps TZ.Eastern]
[CEN.Midwest overlaps TZ.Mountain]
[CEN.Northeast is_included_in TZ.Eastern]
[CEN.South disjoint TZ.Pacific]
[CEN.South overlaps TZ.Central]
[CEN.South overlaps TZ.Eastern]
[CEN.South overlaps TZ.Mountain]
[CEN.USA equals TZ.USA]
[CEN.West disjoint TZ.Central]
[CEN.West disjoint TZ.Eastern]
[CEN.West overlaps TZ.Mountain]
66. Two Taxonomies: NDC vs CEN
“…in the face of incompatible information or data structures among users or among those
specifying the system, attempts to create unitary knowledge categories are futile. Rather, parallel
or multiple representational forms are required” [Bowker & Star, 2000, p.159]
West
Southwest Southeast
Midwest North-
east
West
South
Midwest North-
east
National Diversity Council map (NDC) US Census Buero map (CEN)
Source: Yi-Yun (Jessica) Cheng (PhD student, iSchool @ Illinois)
72. How we align two taxonomies T1 and T2
• Step 1. Supply input taxonomies T1
and T2
• Step 2. Describe the relationships
between T1 and T2
• Step 3. Iteratively edit articulations
in Euler/X
T1
T2
T1
T2
Inconsistent (N=0)
Ambiguous (N>1)
T3
Add/Edit
Articulations A
Euler/X
N Possible Worlds
N=1 N=0 or N>1
• … but where do the articulations
come from??
– expert opinion
– automatically derived from data
84. Implications
• Logic-based taxonomy alignment approach
– Disambiguate name-based taxonomy alignment over time
• 40% of the concepts in biology taxonomies undergoes
name change over time (Franz et al., 2016)
– May mitigate problems in equivalent crosswalking
• Membership condition problem that was often criticized in
crosswalking
– Preserves the original taxonomies while providing an
alignment view
• Solve data integration problems that happen in the more
coarse-grained relative crosswalking
11/01/17
Cheng
https://github.com/EulerProject/ASIST17
yiyunyc2@illinois.edu
85. • … Aristotle …
• … Euler …
• …
• … Greg Whitbread …
• [BPB93] J. H. Beach, S. Pramanik, and J. H. Beaman. Hierarchic
taxonomic databases.,Advances in Computer Methods for Systematic
Biology: Artificial Intelligence, Databases, Computer Vision, 1993
• [Ber95] Walter G. Berendsohn. The concept of “potential taxa” in
databases. Taxon, 44:207–212, 1995.
• [Ber03] Walter G. Berendsohn. MoReTax – Handling Factual Information
Linked to Taxonomic Concepts in Biology. No. 39 in Schriftenreihe für
Vegetationskunde. Bundesamt für Naturschutz, 2003.
• [GG03] M. Geoffroy and A. Güntsch. Assembling and navigating the
potential taxon graph. In [Ber03], pages 71–82, 2003.
• [TL07] Thau, D., & Ludäscher, B. (2007). Reasoning about taxonomies in
first-order logic. Ecological Informatics, 2(3), 195-209.
• [FP09] Franz, N. M., & Peet, R. K. (2009). Perspectives: towards a
language for mapping relationships among taxonomic concepts.
Systematics and Biodiversity, 7(1), 5-20.
• … 85
Some History