SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
Digitising Natural
History
Marieke van Erp
marieke.van.erp@vu.nl
1
2
New technology offers many new possibilities
• improves collection management
• opens up new avenues of research
• digital collection access
3
Why Digitise?
Digitisation at Naturalis
• goal is to have 7 million objects digitised by mid-2015
(out of 37 million) + robust infrastructure for
continuation of digitisation
• 3 million within Naturalis digitisation streets
• 4 million elsewhere
• other 30 million objects will be digitised at less detailed
level
4
5
6
7
• Leposoma Guianense, Sipaliwini, 4 km e. of
airport, near base camp, forest ground,
among leaves, 28-VIII-1968, 12.45 u. reg. nr.
13879
8
Genus
Species
Region
Location
Biotope
Date
Time
Reg #
Leposoma
Guianense
Sipaliwini
4 km e. of airport
near base camp, forest ground
among leaves
28-08-1968
12:45
13879
9
But what you really want...
• Leposoma Guianense, Sipaliwini, 4 km e. of airport,
near base camp, forest ground, among leaves, 28-
VIII-1968, 12.45 u. reg. nr. 13879
• ask a computer to learn to segment and classify
text snippets
10
• Manually annotate 500 text snippets (~3h)
• 300 for training
• 200 for testing
11
• 49,688 new database records (547,528
database cells) at ~84.57 accuracy
12
• 16,870 records describing characteristics and
history of animal specimens in a natural
history database
• 39 columns
• Dutch, English, German and Portuguese
• numeric and textual values (both atomic and
elaborate)
The Manually Created Reptiles and
Amphibians Database
13
column Name value
order
genus
country
biotope
collection date
type
determinator
defined by
special remarks
Anura
Megophrys
Indonesia
in rain near road
01.02.1888
holotype
A. Dubois
(Linnaeus, 1758)
in bad condition, was eaten by
Leptodactylus rugosus (3023) at
night and thrown up again the next
morning when killed, partly digested
14
15
• a database provides structure
• computers are good at comparing values
• statistical methods can detect
inconsistencies
16
17
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae Geophis Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae Geophis Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
18
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
19
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
20
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
21
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
22
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
23
author determinator family genus country
preservation
method
(Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry)
(Schlegel) G. vd. Boog Colubridae ? Indonesia -----
Schneider
M. S.
Hoogmoed
------ Bufo Suriname -----
(Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
actual value: Geophis
predicted value: Rhapdophis
24
• <100 cells to check for a column instead of
16,780
• recall (estimate): 90-100%
• one-size-fits-all
25
• Data-driven cleaning cannot detect
systematic errors
• Maybe systematics can help?
26
subject relation object
specimen
collection
occurs before
entry in
museum
species
has broader
term
genus
city falls within country
27
• detects inconsistencies database usage
• small scope
• high recall and precision within scope
• needs adapting for each new domain
28
29
Disambiguating
Locations
30
Challenge Example
Ambiguous location name Amsterdam
Two or more location
descriptors
Wakarusa, 24mi WSW of
Lawrence
Topological nesting Moccassin Creek on Hog Island
Complex description
Bupo [?Buso] River, 15 miles
[24km] E of Lae
Linear feature measurement 16km (by road) N of Murtoa
Linear ambiguity
On the road between Sydney
and Bathurst
Vague localities Southeast Michigan
Changed political borders Yugoslavia
Historical Place Names British North Borneo
• Randomly annotated geographical
information in 200 database records
• 50 records for development, 150 for testing
31
• Record retrieval
• Text parsing
• Gazetteer lookup
• Offset calculation
• Disambiguation Heuristics
32
Knowledge-driven
Georeferencing
Offset
33
Disambiguation
Heuristics
• Spatial Minimality
• if Amsterdam and Utrecht are mentioned in the same record,
then Amsterdam, NL is more likely than Amsterdam, NY, USA
• Expedition clusters
• It is unlikely that a collector was collecting in Europe on
Monday and in the US on Tuesday
• Species occurrence data
• GBIF can tell us where a certain species does or does not
occur
34
Species Occurrence
Data
35
Results
36
Correct
@5km
Correct
@25km
Correct
@100km
Mean
distance
off
Not Found
Baseline
+ Google
maps + fuzzy
+ Spatial
minimality
+ Expedition
+ GBIF
38.9 47.0 58.4 251.1 26.2
53.0 65.1 74.5 244.1 8.7
59.1 71.8 77.2 171.1 7.4
59.1 71.8 77.2 171.1 7.4
61.7 74.5 79.9 114.5 7.4
Confidence
37
Generating Stories
38
Image source: http://www.gungeralv.org/dg/images/chapter1.JPG
Work in Progress
39
• data cleaning is essential
• “digitising” a heritage collection is
complicated
• don’t try to tame text
General Conclusions
40
Thank you for your
attention!
41
• CATCH: http://www.nwo.nl/catch
• MITCH: http://ilk.uvt.nl/mitch
• NewsReader: http://www.newsreader-project.eu
42
• More information about machine learning
• Video explaining k-nearest neighbour
algorithm: http://videolectures.net/
aaai07_bosch_knnc/
• Weka Toolkit: http://
www.cs.waikato.ac.nz/ml/weka/
43

Mais conteúdo relacionado

Mais de Marieke van Erp

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumMarieke van Erp
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebMarieke van Erp
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit Marieke van Erp
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceMarieke van Erp
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesMarieke van Erp
 
Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Marieke van Erp
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research Marieke van Erp
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Marieke van Erp
 
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchMarieke van Erp
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Marieke van Erp
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsMarieke van Erp
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Marieke van Erp
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Marieke van Erp
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationMarieke van Erp
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Marieke van Erp
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Marieke van Erp
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction Marieke van Erp
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...Marieke van Erp
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...Marieke van Erp
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsMarieke van Erp
 

Mais de Marieke van Erp (20)

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic Web
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and Space
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital Humanities
 
Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...
 
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchSlicing and Dicing a Newspaper Corpus for Historical Ecology Research
Slicing and Dicing a Newspaper Corpus for Historical Ecology Research
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologists
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the Conversation
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Último (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Orientation EBC 2013: Digitising Natural History

  • 1. Digitising Natural History Marieke van Erp marieke.van.erp@vu.nl 1
  • 2. 2
  • 3. New technology offers many new possibilities • improves collection management • opens up new avenues of research • digital collection access 3 Why Digitise?
  • 4. Digitisation at Naturalis • goal is to have 7 million objects digitised by mid-2015 (out of 37 million) + robust infrastructure for continuation of digitisation • 3 million within Naturalis digitisation streets • 4 million elsewhere • other 30 million objects will be digitised at less detailed level 4
  • 5. 5
  • 6. 6
  • 7. 7
  • 8. • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28-VIII-1968, 12.45 u. reg. nr. 13879 8
  • 9. Genus Species Region Location Biotope Date Time Reg # Leposoma Guianense Sipaliwini 4 km e. of airport near base camp, forest ground among leaves 28-08-1968 12:45 13879 9 But what you really want...
  • 10. • Leposoma Guianense, Sipaliwini, 4 km e. of airport, near base camp, forest ground, among leaves, 28- VIII-1968, 12.45 u. reg. nr. 13879 • ask a computer to learn to segment and classify text snippets 10
  • 11. • Manually annotate 500 text snippets (~3h) • 300 for training • 200 for testing 11
  • 12. • 49,688 new database records (547,528 database cells) at ~84.57 accuracy 12
  • 13. • 16,870 records describing characteristics and history of animal specimens in a natural history database • 39 columns • Dutch, English, German and Portuguese • numeric and textual values (both atomic and elaborate) The Manually Created Reptiles and Amphibians Database 13
  • 14. column Name value order genus country biotope collection date type determinator defined by special remarks Anura Megophrys Indonesia in rain near road 01.02.1888 holotype A. Dubois (Linnaeus, 1758) in bad condition, was eaten by Leptodactylus rugosus (3023) at night and thrown up again the next morning when killed, partly digested 14
  • 15. 15
  • 16. • a database provides structure • computers are good at comparing values • statistical methods can detect inconsistencies 16
  • 17. 17 author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol
  • 18. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae Geophis Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol 18
  • 19. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 19
  • 20. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 20
  • 21. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 21
  • 22. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 22
  • 23. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis 23
  • 24. author determinator family genus country preservation method (Daudin, 1802) ------ Bataguridae Anolis Cambodja (shield, dry) (Schlegel) G. vd. Boog Colubridae ? Indonesia ----- Schneider M. S. Hoogmoed ------ Bufo Suriname ----- (Horst, 1883) Tyler, M J Hylidae Litoria ------ alcohol actual value: Geophis predicted value: Rhapdophis 24
  • 25. • <100 cells to check for a column instead of 16,780 • recall (estimate): 90-100% • one-size-fits-all 25
  • 26. • Data-driven cleaning cannot detect systematic errors • Maybe systematics can help? 26
  • 27. subject relation object specimen collection occurs before entry in museum species has broader term genus city falls within country 27
  • 28. • detects inconsistencies database usage • small scope • high recall and precision within scope • needs adapting for each new domain 28
  • 30. 30 Challenge Example Ambiguous location name Amsterdam Two or more location descriptors Wakarusa, 24mi WSW of Lawrence Topological nesting Moccassin Creek on Hog Island Complex description Bupo [?Buso] River, 15 miles [24km] E of Lae Linear feature measurement 16km (by road) N of Murtoa Linear ambiguity On the road between Sydney and Bathurst Vague localities Southeast Michigan Changed political borders Yugoslavia Historical Place Names British North Borneo
  • 31. • Randomly annotated geographical information in 200 database records • 50 records for development, 150 for testing 31
  • 32. • Record retrieval • Text parsing • Gazetteer lookup • Offset calculation • Disambiguation Heuristics 32 Knowledge-driven Georeferencing
  • 34. Disambiguation Heuristics • Spatial Minimality • if Amsterdam and Utrecht are mentioned in the same record, then Amsterdam, NL is more likely than Amsterdam, NY, USA • Expedition clusters • It is unlikely that a collector was collecting in Europe on Monday and in the US on Tuesday • Species occurrence data • GBIF can tell us where a certain species does or does not occur 34
  • 36. Results 36 Correct @5km Correct @25km Correct @100km Mean distance off Not Found Baseline + Google maps + fuzzy + Spatial minimality + Expedition + GBIF 38.9 47.0 58.4 251.1 26.2 53.0 65.1 74.5 244.1 8.7 59.1 71.8 77.2 171.1 7.4 59.1 71.8 77.2 171.1 7.4 61.7 74.5 79.9 114.5 7.4
  • 38. Generating Stories 38 Image source: http://www.gungeralv.org/dg/images/chapter1.JPG
  • 40. • data cleaning is essential • “digitising” a heritage collection is complicated • don’t try to tame text General Conclusions 40
  • 41. Thank you for your attention! 41
  • 42. • CATCH: http://www.nwo.nl/catch • MITCH: http://ilk.uvt.nl/mitch • NewsReader: http://www.newsreader-project.eu 42
  • 43. • More information about machine learning • Video explaining k-nearest neighbour algorithm: http://videolectures.net/ aaai07_bosch_knnc/ • Weka Toolkit: http:// www.cs.waikato.ac.nz/ml/weka/ 43