Darwin Core Archive (DwC-A) validation: A New Collaborative Effort
1. Darwin Core Archive (DwC-A)
validation: A New Collaborative
Effort
Christian Gendreau, Université de Montréal / Canadensys
David P. Shorthouse, Université de Montréal / Canadensys
Marie-Élise Lecoq, GBIF France
Tim Robertson, GBIF
2. Darwin Core Archive (DwC-A)
DarwinCore standard does not impose strong
rules on the content associated with any
DarwinCore terms.
3. Current GBIF DwC-A Validator
Original goal
“… test Darwin Core Archives as specified in the
Darwin Core Text Guide.”
http://tools.gbif.org/dwca-validator/
4. Current GBIF DwC-A Validator
Original target
DwC-A are simple and can be created using
simple custom scripts.
“… make sure GBIF and others can read the
information as expected.”
5. Current GBIF DwC-A Validator
• Validates archive structure
• Offer web presence
– Report viewer
– API
6. Next GBIF DwC-A Validator?
New goal
Extends validation to the content of the archive
https://github.com/gbif/dwca-validator
7. Current content validators
• Atlas of Living Australia sandbox
• VertNet – Spatial quality
• GBIF Spain – Darwin Test
• Encyclopedia of Life – dwc-validator
• Scratchpads – dwca-validator
• GlobalNames – dwc-archive ruby gem
• … much more
See Appendix 1 for links
8. What we need?
• Accommodate different scopes
• Configuration/customizations
– Use more knowledge when available
• Web access (page and API)
9. Scopes
• Data entry
• Desktop software
– Scientific Work Flow
– Statistical software
• Integrated Publishing Toolkit (IPT)
• National nodes
• Aggregators
10. Configuration/Customization
• Where the validator will be used?
• Can we provide more information?
– e.g. I know all the dates in my file should be ISO
15. Internals
• Validation types
– Structure
• Metadata
– Records : Rows
• Fields data (e.g. date, coordinates)
– Records : Columns
• ID uniqueness
16. Internals – Record level
• Validation chain
– Composed by chain elements
– Possible parallelism
17. Internals – Record level
• Immutable Chain element
– Self contained
• Never relies on another chain element
– Ordering independent
• Same behaviour wherever the element is used in the
chain
But what if I need really ordering?
19. Composition example
• Mandatory Latitude/Longitude
– Check record completion on lat/long
– Check decimal lat/long value
20. Configuration example
• Select mandatory DarwinCore terms
– scientificName must be provided
• Restrict bounding box
– decimalLatitude and decimalLongitude must be
between
21. Customization example
• Apply your own controlled vocabulary
– Use your own dictionary for a term
– ControlledVocabularyEvaluationRule
22. Extension Example
• Suggester, link to narhwal-processor
– Suède –> ISO 3166-2:SE
– URI –> http://sws.geonames.org/2661886
23. Collaborative
• Share configuration
• Share customization (dictionary)
• Implement new reusable component
– e.g. validation on specific Dwc-A extension
24. Collaboration
• Where to go?
– https://github.com/gbif/dwca-validator
• Who can contribute?
– Everyone
• What is needed?
– Ideas, constructive comments
– Code review, feedback
25. Project status
• Not yet released
• Command line interface available
Follow the project on GitHub
27. Special thanks
• SiB Colombia
• SiB Brazil
• Peter Desmet
• John Wieczorek
• Dag Endresen
• …
28. Appendix 1
DwC Content validators
Atlas of Living Australia sandbox
http://sandbox.ala.org.au/datacheck/
VertNet – Spatial quality
Displayed on occurrence pages at
http://portal.vertnet.org/search
GBIF Spain – Darwin Test
http://www.gbif.es/darwin_test/Darwin_Test_in.php
Encyclopedia of Life – dwc-validator
http://services.eol.org/dwc_validator/
----- Notes de la réunion (2014-10-20 14:54) -----
examples
----- Notes de la réunion (2014-10-20 14:54) -----
suggester : explain it
----- Notes de la réunion (2014-10-20 14:54) -----
collaboration received
where to go
current state, timeline
current challenges, collaboration is needed