O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

5 years of Dataverse evolution

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 39 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a 5 years of Dataverse evolution (20)

Anúncio

Mais recentes (20)

Anúncio

5 years of Dataverse evolution

  1. 1. 5 years of Dataverse evolution Slava Tykhonov Senior Information Scientist, Research & Innovation meeting (DANS-KNAW) 26.01.2021
  2. 2. Dataverse based Clio Infra collaboration platform (2015) Clio Infra functionality based on the Dataverse solution: - teams curate, share and analyze research datasets collaboratively - teams members can share the responsibility to collect data on specific variables (for example, countries) and inform each other about changes and additions - dataset version control system is able to track changes in datasets - other researchers can download their own copy of the data if dataset is published as Open Data Dataverse in flexible metadata store (Dataverse) that connected with Research datasets storage by data processing engine
  3. 3. Interactive Clio Infra Dashboard with data in Dataverse (2015)
  4. 4. DANS Dataverse 3.x migration (2016) Basic DataverseNL services: • Federated login for Netherlands institutions • Persistent Identifier Services (DOI and handle) • Integration with archival systems Applications: • Modern and historical world maps visualisations • Data API and Geo API services for projects with data • Panel datasets constructor • Time series plot • Treemaps • Pie and chart visualizations • Descriptive statistics tools
  5. 5. Major challenges to provide services for researchers ● Maintenance concerns - who will be in charge after project is finished? ● Infrastructure problems - how to install and run tools for researchers? ● Various Interoperability issues - how to leverage data exchange between different systems and services Software updates and bug fixing, licences, technical staff training, legal aspects and so on...
  6. 6. The influence of APIs standards on innovation Source: V. Tykhonov “API Economy”
  7. 7. Interoperability in EOSC ● Technical interoperability defined as the “ability of different information technology systems and software applications to communicate and exchange data”. It should allow “to accept data from each other and perform a given task in an appropriate and satisfactory manner without the need for extra operator intervention”. ● Semantic interoperability is “the ability of computer systems to transmit data with unambiguous, shared meaning. Semantic interoperability is a requirement to enable machine computable logic, inferencing, knowledge discovery, and data”. ● Organisational interoperability refers to the “way in which organisations align their business processes, responsibilities and expectations to achieve commonly agreed and mutually beneficial goals. Focus on the requirements of the user community by making services available, easily identifiable, accessible and user-focused”. ● Legal interoperability covers “the broader environment of laws, policies, procedures and cooperation agreements” Source: EOSC Interoperability Framework v1.0
  8. 8. Open vs Closed Innovation
  9. 9. DANS Data Stations - Future Data Services Dataverse is API based data platform and a key framework for Open Innovation!
  10. 10. Dataverse architecture in the nutshell Basic components: Database (postgres), search index (solr) and web application (Glassfish/Payara) Simple but powerful! How about maintenance?
  11. 11. Dataverse Docker module (CESSDA Dataverse, 2018) Source: https://github.com/IQSS/dataverse-docker
  12. 12. The Cathedral and the Bazaar “The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary (abbreviated CatB) is an essay, and later a book, by Eric S. Raymond on software engineering methods, based on his observations of the Linux kernel development process and his experiences managing an open source project, fetchmail. It examines the struggle between top-down and bottom-up design.” Wikipedia Some important points: ● Smart data structures and dumb code works a lot better than the other way around ● When writing gateway software of any kind, take pains to disturb the data stream as little as possible—and never throw away information unless the recipient forces you to! ● Any tool should be useful in the expected way, but a truly great tool lends itself to uses you never expected
  13. 13. Principle of good enough The principle of good enough or "good enough" principle is a rule in software and systems design. It indicates that consumers will use products that are good enough for their requirements, despite the availability of more advanced technology. Wikipedia The KISS Principle of "Keep it Simple, Stupid” provides a series of design rules, some of them: ● Separate mechanisms from policy ● Write simple programs ● Write transparent programs ● Value developer time over machine time ● Make data complicated when required, not the program ● Build on potential users' expected knowledge ● Write programs which fail in a way that is easy to diagnose ● Prototype software before polishing it ● Make the program and protocols extensible
  14. 14. What should be simplified to make Dataverse “good enough”? “One-liner” installation requirements include: ● even users without any technical knowledge should be able to install it ● simple, clear and transparent infrastructure ready for integration (Docker based) ● reverse proxy and load balancer should be set up both locally and on a remote host to run Dataverse website (Nginx/Traefik) Q: How do we cross the chasm? A: Let’s try to capture the mainstream!
  15. 15. Using Dataverse to fight against COVID-19 1300+ people registered in the organization 15
  16. 16. Jupyter integration: datasets conversion to pandas dataframe Can AI researchers read and reuse data directly from Dataverse in a collaborative way?
  17. 17. Crossing the chasm... The technology adoption requires further automation of all processes. Our goal is to deliver production ready Dataverse for the European Open Science Cloud (EOSC): ● SSHOC project: Docker/Kubernetes, common CI/CD pipeline, integration tests, previewers, language localization, external tools ● EOSC Synergy Software Quality Assurance (SqaaS) pipeline integration ● CLARIAH - leveraging metadata schema with CLARIN community, CLARIN tools integration, development common pipelines ● FAIRsFAIR - enabling FAIR Data Points (FDP) in Dataverse ● ODISSEI - using Dataverse as a data registry
  18. 18. Services in European Open Science Cloud (EOSC) ● EOSC requires the level 8 of maturity (at least) ● we need the highest quality of software to be accepted as a service ● clear and transparent evaluation of services is essential ● the evidence of technical maturity is the key to success ● the limited warranty will allow to stop out-of- warranty services
  19. 19. Running Dataverse in production on Cloud HTTP(S) Load Balancer Kubernetes Engine Container Registry Dataverse Service Kubernetes Cluster K8S Cluster Node Dataverse Deployment PostgreS QL Service Solr Deployment PostgreSQL Deployment Users Certbot Cronjob Email Relay Deployment Certbot Service Email relay Service Dataverse Service Solr Service
  20. 20. Dataverse Kubernetes Project maintained by Oliver Bertuch (FZ Julich) and available in Global Dataverse Community Consortium github (GDCC) Google Cloud, Amazon AWS, Microsoft Azure platforms supported Open Source, community pull requests are welcome http://github.com/IQSS/dataverse-kubernetes
  21. 21. SQA process with Selenium tests for Dataverse Selenium IDE allows to create and replay all UI tests in your browser Shared tests can be reused by community to increase reproducibility SQA for the service maturity = unit tests + integration tests 21 Source: SSHOC project, data repositories task WP5.2
  22. 22. CI/CD pipeline with SQAaaS (S) 1 2 3 git push Push GCP container registry webhook Create docker image Kubernetes Deployment git clone Jenkins pipeline (Jenkinsfile) 9 7 Run SQA S 8 1. Developer pushes code to GitHub 2. Jenkins receives notification - build trigger 3. Jenkins clones the workspace 4. (S) Runs SQA tests and does FAIRness check 5. (S) Issuing digital badge according to the results 6. (S) SQAaaS API triggers appropriate workflow 7. Creates docker image if success 8. Pushes new docker image to container registry 9. Updates the kubernetes deployment 22 Source: EOSC Synergy project
  23. 23. Data Commons is essential for integrations Merce Crosas, “Harvard Data Commons”
  24. 24. FAIR Dataverse Source: Mercè Crosas, “FAIR principles and beyond: implementation in Dataverse”
  25. 25. Our goals to increase Dataverse interoperability Provide a custom FAIR metadata schema for European research communities: ● CESSDA metadata (Consortium of European Social Science Data Archives) ● Component MetaData Infrastructure (CMDI) metadata from CLARIN linguistics community Connect metadata to ontologies and CVs: ● link metadata fields to common ontologies (Dublin Core, DCAT) ● define semantic relationships between (new) metadata fields (SKOS) ● select available external controlled vocabularies for the specific fields ● provide multilingual access to controlled vocabularies
  26. 26. One metadata field can be linked to many ontologies Language switch in Dataverse will change the language of suggested terms!
  27. 27. The FAIR Signposting Profile Herbert Van de Sompel https://hvdsomp.info Two levels of access to Web resources: ● level one provides a concise set of links or a minimal set of links by value in the HTTP header ● level two delivers a complete comprehensive set of links by reference meaning in a standalone document (link set)
  28. 28. Dataverse meta(data) in FAIR Data Point (FDP) ● RESTful web service that enables data owners to expose their data sets using rich machine-readable metadata ● Provides standardized descriptions (RDF-based metadata) using controlled vocabularies and ontologies ● FDP spec is public Source: FDP The goal is to run FDP on Dataverse side (DCAT, CVs) and provide metadata export in RDF!
  29. 29. F-UJI Automated FAIR Data Assessment Tool
  30. 30. Dataverse localization with Weblate ● service to connect files to Weblate in order to translate them in a structured way ● several options for project visibility: accept translations by the crowd, or only give access to a select group of translators. ● Weblate indicates untranslated strings, strings with failing checks, and strings that need approval. ● when new strings are added with an upgrade of Dataverse, Weblate can indicate which strings are new and untranslated.
  31. 31. GUI translation with Weblate as a service Source: SSHOC Weblate
  32. 32. Dataverse App Store Data preview: DDI Explorer, Spreadsheet/CSV, PDF, Text files, HTML, Images, video render, audio, JSON, GeoJSON/Shapefiles/Map, XML Interoperability: external controlled vocabularies (CESSDA CV Manager) Data processing: NESSTAR DDI migration tool Linked Data: RDF compliance including SPARQL endpoint (FDP) Federated login: eduGAIN, PIONIER ID CLARIN Switchboard integration: Natural Language Processing tools Visualization tools (maps, charts, timelines)
  33. 33. Dataverse and CLARIN tools integration
  34. 34. Make Data Count Make Data Count is part of a broader Research Data Alliance (RDA) Data Usage Metrics Working Group which helped to produce a specification called the COUNTER Code of Practice for Research Data. The following metrics can be downloaded directly from the DataCite hub for datasets hosted by Dataverse installations: ● Total Views for a Dataset ● Unique Views for a Dataset ● Total Downloads for a Dataset ● Downloads for a Dataset ● Citations for a Dataset (via Crossref) Dataverse Metrics API is a powerful source for BI tools used for the Data Landscape monitoring.
  35. 35. Metrics for BI and integration with Apache Superset Source: Apache Superset (Open Source)
  36. 36. Apache Superset visualizations
  37. 37. Apache Airflow for Dataverse pipelines ● Intended for acyclic processes, around those processing data with a point of "completion." ● DAG (Directed Acyclic Graph) is a collection of all the tasks organized in a way that reflects their relationships and dependencies ● absolutely essential component for the harvesting and depositing data ● Airflow dashboard allows to get a clear overview and status of all running processes On the roadmap of ODISSEI project!
  38. 38. Conclusion Due to the open architecture and the use of open standards, Dataverse team has managed to attract the best people and create a strong community, and finally build a product completely aligned with principles of Open Innovation. Suitable for the future, community-driven, it has all chances to “cross the chasm” and become a prominent FAIR data repository on all continents. Dataverse already has a very rich ecosystem for technological innovation that will allow to integrate tools which don't exist yet. “Any tool should be useful in the expected way, but a truly great tool lends itself to uses you never expected”...
  39. 39. Questions? Slava Tykhonov, Senior Information Scientist vyacheslav.tykhonov@dans.knaw.nl

×