3. New technology offers many new possibilities
• improves collection management
• opens up new avenues of research
• digital collection access
3
Why Digitise?
4. Digitisation at Naturalis
• goal is to have 7 million objects digitised by mid-2015
(out of 37 million) + robust infrastructure for
continuation of digitisation
• 3 million within Naturalis digitisation streets
• 4 million elsewhere
• other 30 million objects will be digitised at less detailed
level
4
10. • Leposoma Guianense, Sipaliwini, 4 km e. of airport,
near base camp, forest ground, among leaves, 28-
VIII-1968, 12.45 u. reg. nr. 13879
• ask a computer to learn to segment and classify
text snippets
10
11. • Manually annotate 500 text snippets (~3h)
• 300 for training
• 200 for testing
11
12. • 49,688 new database records (547,528
database cells) at ~84.57 accuracy
12
13. • 16,870 records describing characteristics and
history of animal specimens in a natural
history database
• 39 columns
• Dutch, English, German and Portuguese
• numeric and textual values (both atomic and
elaborate)
The Manually Created Reptiles and
Amphibians Database
13
14. column Name value
order
genus
country
biotope
collection date
type
determinator
defined by
special remarks
Anura
Megophrys
Indonesia
in rain near road
01.02.1888
holotype
A. Dubois
(Linnaeus, 1758)
in bad condition, was eaten by
Leptodactylus rugosus (3023) at
night and thrown up again the next
morning when killed, partly digested
14
16. • a database provides structure
• computers are good at comparing values
• statistical methods can detect
inconsistencies
16
17. 17
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae Geophis Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
18. author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae Geophis Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
18
19. author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
19
20. author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
20
21. author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
21
22. author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
22
23. author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
23
24. author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
predicted value: Rhapdophis
24
25. • <100 cells to check for a column instead of
16,780
• recall (estimate): 90-100%
• one-size-fits-all
25
30. 30
Challenge Example
Ambiguous location name Amsterdam
Two or more location
descriptors
Wakarusa, 24mi WSW of
Lawrence
Topological nesting Moccassin Creek on Hog Island
Complex description
Bupo [?Buso] River, 15 miles
[24km] E of Lae
Linear feature measurement 16km (by road) N of Murtoa
Linear ambiguity
On the road between Sydney
and Bathurst
Vague localities Southeast Michigan
Changed political borders Yugoslavia
Historical Place Names British North Borneo
31. • Randomly annotated geographical
information in 200 database records
• 50 records for development, 150 for testing
31
32. • Record retrieval
• Text parsing
• Gazetteer lookup
• Offset calculation
• Disambiguation Heuristics
32
Knowledge-driven
Georeferencing
34. Disambiguation
Heuristics
• Spatial Minimality
• if Amsterdam and Utrecht are mentioned in the same record,
then Amsterdam, NL is more likely than Amsterdam, NY, USA
• Expedition clusters
• It is unlikely that a collector was collecting in Europe on
Monday and in the US on Tuesday
• Species occurrence data
• GBIF can tell us where a certain species does or does not
occur
34