Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
1. Franco Niccolucci & Achille Felicetti
(PIN, University of Florence, Italy)
EOSC-hub Week 2018
Malaga, 16/4/2018
2. EOSCpilot is a project funded by the EC H2020 programme
Domain: Archaeology
Goal: semantic enrichment of texts
Archaeological documentation largely based on texts
◦ Excavation diaries, reports, surveys, grey literature
◦ Literary/historical sources. research articles, monographs …
◦ Huge number of small (<100Kb) files in different languages
Registry of 2,000,000 archaeological datasets (70% texts) in ARIADNE
ARIADNE’s data infrastructure popular among archaeologists
◦ ARIADNE users in 2016: 25-30% of the European research community
◦ Strong support by
Professional associations (EAA, EAC) & national archaeological/cultural heritage authorities
National research institutions (CNR, CNRS, CAS, ÖAW, KNAW, BAS, ATHENA RC, FORTH)
International recognition (USA, Mexico, Japan, Argentina)
Needed for cloud-based data infrastructure to be developed in ARIADNEplus
◦ Deeper integration between texts, databases, GIS etc.
◦ Advanced services & VREs for data-centric archaeological research
2
3. EOSCpilot is a project funded by the EC H2020 programme
NLP & NER OS engine
Syntactic rules (tailored to specific writing style)
Texts stating facts, not stories
◦ Data fuzziness, provenance, reliability, reasoning
Domain ontology: CIDOC CRM (ISO 21127:2006)
◦ ... and not TEI
Terminology
◦ Specialized vocabularies
Terra sigillata is not just “sealed earth”
◦ Gazetteers for modern (Geonames) and ancient (Pleiades) place names
Málaga (modern) vs Màlaka (Phoenician) vs Màlaca (Roman)
◦ Named time period management
Bronze Age (∼ 3200-600 BC), Recent Orientalizing Period (∼ 630-570 BC)
4. EOSCpilot is a project funded by the EC H2020 programme
Modular framework based on GATE toolchain: https://gate.ac.uk
◦ Advanced stemming/lemmatization components
OpenNLP (https://opennlp.apache.org) : sentence segmentation and part of
speech (POS) tagging
OpeNER (http://www.opener-project.eu) neuronal network for advanced
named entities recognition (NER), developed in OpeNER FP7 project
◦ Machine learning framework for auto education
Annotated corpus required
Ontology: CRMarcheo (CRM extension for archaeology)
Vocabularies, gazetteers and terminological tools
◦ ICCD vocabularies for Italian archaeology, augmented with term lists
created on purpose
◦ Geonames (modern places), Pleiades (historical places)
◦ Timespan and named period component based on PeriodO
4
5. EOSCpilot is a project funded by the EC H2020 programme
TextCrowd detects:
◦ Artefacts
◦ Colours
◦ Materials
◦ Time periods
◦ Persons
◦ Places
◦ Sites
◦ Time spans
◦ Techniques
Target output formats:
◦ Textual documents automatically annotated and enriched
◦ CIDOC CRM semantic triples (RDF)
5
6. EOSCpilot is a project funded by the EC H2020 programme
No annotated text corpora available in Italian to be used as training data for
machine learning algorithms
◦ Manual annotation of 400 pages of Italian archaeology reports (< 1 Person-Month)
Preparation and adaptation of vocabularies
Availability of user-friendly cloud-based environments and of necessary tools, to
migrate standalone prototype to cloud
◦ Several cloud solutions tested in early development, limited support provided except in
D4Science
◦ Implementation in D4Science infrastructure, but portable to other cloud services if support and
required modules available
Authentication and Authorization
◦ No access control to metadata/data implemented so far
◦ Demonstrator focused on freely accessible textual documents
◦ Fasti Online used (http://www.fastionline.org) Open Access collection of archaeological reports
6
7. EOSCpilot is a project funded by the EC H2020 programme
Operated and maintained by CNR-ISTI on the D4Science platform
https://www.d4science.org
Modular engine based on GATE toolchain + OpenNLP-OpeNER
modules, natively provided by D4Science
Web-based user interface for
◦ User and access management
◦ Cloud storage (private and shared files)
◦ Results available for other Virtual Research Environments (VRE) within D4Science
Released for open use, for tests & comments
No fancy interface produced, also to adapt to any Look-and-Feel
7
8. EOSCpilot is a project funded by the EC H2020 programme
Machine-readable results: RDF encoding produced
Human-readable results: color-encoded text (for testing)
Interoperability of extracted knowledge
◦ Semantic information in CRM format: full integration and interoperability with
other archaeological semantic data (to be fully implemented in ARIADNEplus)
Supporting FAIR Principles implementation
◦ Metadata to be stored in various registries for easy findability and accessibility
◦ Results ready to be reused within the same environment or consumed by other
services and/or in different scenarios
8
9. EOSCpilot is a project funded by the EC H2020 programme
TEXTCROWD has shown to be useful for its main purpose: to demonstrate
the importance and usefulness of EOSC for scientific research in the cultural
heritage domain
Adoption by other research teams in the EOSCpilot framework
◦ Integration of TEXTCROWD with new VisualMedia Demonstrator: a service for
sharing and visualizing visual media files on the web - automatic metadata extraction
from controlled lists or textual documents for 2D and 3D models
Testing on real use cases in progress
◦ Open Access papers of the Italian Journal Archeologia e Calcolatori, ongoing
Clean visualization
Language extension
◦ English, Dutch: from standalone to cloud-based (annotated corpora available)
◦ French, Spanish, German: new from scratch (annotated corpora to be prepared)
◦ Other EU languages: OpeNER extension required
Additional work required to suit it to everyday use – but not too much
9
10. EOSCpilot is a project funded by the EC H2020 programme
TEXTCROWD Official Pages:
https://eoscpilot.eu/science-demos/textcrowd
https://textcrowd.d4science.org
TEXTCROWD Pilot:
https://services.d4science.org/group/textcrowd/data-miner
(registration required)
10
11. EOSCpilot is a project funded by the EC H2020 programme
1. Upload the file(s) to analyze
2. Launch TextCrowd
3. Select the file(s) to process
4. Collect the results
18. EOSCpilot is a project funded by the EC H2020 programme
Franco Niccolucci: franco.niccolucci@gmail.com – Achille Felicetti: achille.felicetti@pin.unifi.it