Building collaborative Machine Learning platform for Dataverse network. Lecture by Slava Tykhonov (DANS-KNAW, the Netherlands), DANS seminar series, 29.03.2022
1. Metaverse for Dataverse
Building collaborative Machine Learning platform for
Dataverse network
Slava Tykhonov, R&D
(DANS-KNAW)
DANS seminar, 29.03.2022
2. What is Metaverse?
“A metaverse is a network of 3D virtual worlds focused on social connection. In
futurism and science fiction, it is often described as a hypothetical iteration of the
Internet as a single, universal virtual world that is facilitated by the use of virtual
and augmented reality headsets.”
“Access points for metaverses include general-purpose computers and
smartphones, in addition to augmented reality (AR), mixed reality, virtual reality
(VR), and virtual world technologies.”
Wikipedia
Where is the place of Open Science in this vision?
3. Moving towards Open Science
Source: Citizen Science and Open Science Core Concepts and Areas of Synergy (Vohland and Göbel, 2017)
4. Time Machine project
● An international collaboration to bring 5000 years of European history to
life
● Digitising millions of historical documents, painting and monuments
● The largest computer simulation ever developed
● An open access, interactive resource
● 600+ consortium members from the European countries
● top academic and research institutions
● Private sector partners from SMEs to international companies
“Our focus in on the joint efforts on Big Data, artificial intelligence,
augmented reality and 3D and the development of European
platforms in line with European values”.
Visit http://timemachine.eu
5. (Semantic) Web 3.0 as a new Internet
● Web 3.0 is democratized – it will be built on a decentralized blockchain protocol where there
is no centralized ownership of content, services, or platforms.
● It is semantic – the Semantic Web is not identical to Web 3.0, but is an underlying technology
for the third generation of the internet. It would allow multiple internet pages to be correlated
using a semantic protocol so that the relationships between pages are apparent, indexed, and
searchable.
● It may be spatial in nature – While Web 1.0 and Web 2.0 offered two-dimensional
experiences, Web 3.0 may be immersive and offers spatial experiences similar to the real
world. To achieve this, there will be a spatially-interactive layer on top of the digital
information layer, which uses sensory triggers and controls like voice, gesture, biometric
commands, and others.
Source: XR Today
6. NFT (a non-fungible token) in Metaverse
“The metaverse is a future evolution of the Internet based on persistent, shared
virtual worlds in which people interact as 3D avatars.
Blockchain technology may provide the backbone of the metaverse, with
interoperable NFT assets that can be used across different metaverse spaces.”
Source: Decrypt
Open questions: can Metaverse be created without common FAIR principles?
How the semantic interoperability layer should look like?
7. Vision: Semantic interoperability on the infrastructure level
We envision a situation where thousands of Dataverse instances (due to EOSC) on the
web can be simultaneously search for data and will form Data Lake.
The old dream of Federated search/Universal catalogue can only be realised if:
(1) Cross -walks; mapping across different metadata schemes are implemented
(2) In metadata schemes we seek for ways to enrich indexes with values from controlled
vocabularies
Standard response (centralized) = standardisation and harmonisation = repository
software, certain metadata standards, or certain controlled vocabularies
New response (distributed) = explore agile solutions (Proof of Concepts) which can be
implemented by different communities (even smaller ones), so we keep variety and still
enable integration by applying Linked Data technologies.
8. Data Stations - Future Data Services
Dataverse is API based data platform and a key framework for Open Innovation!
9. Conceptual approach: building common infrastructure components
Dataverse Semantic API in release 5.6: https://github.com/IQSS/dataverse/releases/tag/v5.6
“Dataset metadata can be retrieved, set, and updated using a new, flatter JSON-LD format -
following the format of an OAI-ORE export (RDA-conformant Bags), allowing for easier transfer of
metadata to/from other systems (i.e. without needing to know Dataverse's metadata block and
field storage architecture). This new API also allows for the update of terms metadata“.
External controlled vocabularies support is being developed by DANS in SSHOC project and
already integrated in Dataverse core in release 5.7.
Proposal:
https://docs.google.com/document/d/1txdcFuxskRx_tLsDQ7KKLFTMR_r9IBhorDu3V_r445w/
Interfaces: http://github.com/gdcc/dataverse-external-vocab-support
Integrations: Wikidata, ORCID, MeSH, Skosmos vocabularies
10. External controlled vocabularies in Dataverse
Any research community can run the same Dataverse instance with own controlled vocabularies linked in FAIR way!
11. “Archive in a box” - SSHOC Dataverse
● fully automatic Dataverse deployment with Traefik proxy
● Dataverse configuration managed through environmental file .env
● different Dataverse distributions with services on your preference suitable for different
use cases
● external controlled vocabularies support (demo of CESSDA CMM metadata fields
connected to Skosmos framework)
● MinIO storage support for Cloud Storage
● data previewers integrated in the distributive
● startup process managed through scripts located in init.d folder
● automatic SOLR reindex
● external services integration PostgreSQL triggers
● support of custom metadata schemes (CESSDA CMM, CLARIN CMDI, ...)
● built-in Web interface localization uses Dataverse language pack to support multiple
languages out of the box
https://github.com/IQSS/dataverse-docker
12. Building Dataverse distributions
“Software distribution is the process of delivering software to the end user.
A distro is a collection of software components built, assembled and configured so that it can
essentially be used "as is". It is often the closest thing to turnkey form of free software. A distro may
take the form of a binary distribution, with an executable installer which can be downloaded from the
Internet. Examples range from whole operating system distributions to server and interpreter
distributions (for example WAMP installers).
Examples of software distributions include BSD-based distros (such as FreeBSD, NetBSD, OpenBSD,
and DragonflyBSD) and Linux-based distros (such as openSUSE, Ubuntu, and Fedora). “
Source: Wikipedia
We can build a Dataverse based distributions for research communities and link them into distributed
data network to solve all interoperability problems! EOSC, CESSDA, CLARIAH, DARIAH, ODISSEI,
… will have own metadata schemes but use the same Dataverse technology.
13. Benefits of the Common Data Infrastructure (Distributions)
● maintenance costs will drop massively, as more organizations will join, less
expensive it will be to support
● It’s distributed and sustainable, suitable for the future
● maintenance costs could be reallocated to the training and further
development of the new (common) features
● reuse of the same infrastructure components will enforce the quality and the
speed of the knowledge exchange
● building a multidisciplinary teams reusing the same infra can bring us new
insights and unexpected views
● Common Data Infrastructure plays a role of the universal gravitation layer
for Data Science projects
(and so on…)
14. Historically most of datasets preserved in data silos (archives), not interlinked and
lacking of standardization. There are cultural, structural and technological
challenges.
Solutions:
● Integrating Linked Data and Semantic Web technologies, forcing research
communities to share data and add more interoperability following FAIR
principles
● Create a standardized (meta)data layer for Large Scale projects like Time
Machine and CoronaWhy
● Working on the automatic metadata linkage to ontologies and external
controlled vocabularies in order to get it linked in the Linked Open Data Cloud
● Using the Knowledge Graph for the Machine Learning
Supporting Semantic Web for Data
15. Why Artificial Intelligence?
Human resources are very expensive and deficit, it’s difficult to find
appropriate expertise in-house.
Solution:
● Building AI/ML pipelines for the automatic metadata enrichment and
linkage prediction
● applying NLP for NER, data mining, topic classification etc
● building multidisciplinary knowledge graphs should facilitate the
development of new projects with economic and social scientists, they will
take ownership of their own data if they see value (Clio Infra)
16. How to control Artificial Intelligence
Problem:
It’s naive to fully trust Machine Learning and AI, we need to support a “human
in the loop” processes to take a control over automatic workflows. Ethics is
also important, fake detection problem.
Solution:
A lot of “human in the loop” tools already developed in research projects, we
need to support the best for the different use cases, add the appropriate
maturity, for example, with CI/CD and introduce them to research
communities.
17. Human-in-the-Loop for Machine Learning
“Computers are incredibly fast, accurate
and stupid; humans are incredibly slow,
inaccurate and brilliant; together they
are powerful beyond imagination."
Albert Einstein
“A combination of AI and Human
Intelligence gives rise to an extremely
high level of accuracy and intelligence
(Super Intelligence)”
17
Source: Hackernoon.com
18. Human in the loop explained
General blueprint for a human-in-the-loop interactive AI system. Credits: Stanford University HAI
“how do we build a smarter system?” to “how do we incorporate useful,
meaningful human interaction into the system?”
19. Hypothes.is annotations as a peer review service
1. AI pipeline does
domain specific
entities extraction
and ranking of
relevant papers.
2. Automatic entities
and statements will
be added, important
fragments should be
highlighted.
3. Human annotators
should verify results
and validate all
statements.
19
21. Dataverse network as solution for Metaverse
- persistency, how to archive and move an object (artifact) from one
metaverse space to another virtual world
- how persistent identifiers in Dataverse can solve the problem PIDs for
files (from archiving to live representation)
- Interoperability layer should allow to build smart contracts and track their
usage
Problem: too much human resources and adoption time is about 3-5 years.
How to speed up?
28. Building Metaverse: COVID-19 Museum
● The idea to create a ‘museum’ on Covid-19 objects emerged at the
Université de Paris, France and directed by Prof. Yves Rozenholc
● the COVID-19 Museum is envisioned as a public service and will respect
the ownership of all digital artefacts
● the aim is to create a virtual space where all metadata related to the
pandemic collected and interlinked
● while it starts as a French initiative, the vision of its makers is to turn it
into an international effort in multiple languages
● This museum should bring together researchers and artists, curious and
critical, amateurs and professionals around their objects of care, study
and astonishment.
29. COVID-19 Museum demonstrator
Instagram workflow:
create a subcollection of images related to a few COVID-19 related keywords like ‘nurse’,
‘art’, ‘museum’, etc. Extraction and storage of metadata in Dataverse collections should be
done without breaking any copyrights by keeping original references to the artifacts.
COVID-19 pandemic coverage in newspapers:
collect news articles from the French news media
for each article extract important public attributes like title, date, summary, illustrating image
and a few lines of text publicly available.
Corona pages in Wikipedia:
collect attributes from wiki pages revision: date of change, author, contribution by amount of
characters
realize a 2D-timeline of this set of wikipedia pages with respect to time-creation and time-
history and offer a vision of the history evolution of every page.