O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

(Big) bibliographic data @ ScaDS project meeting - 2015-06-12

2.757 visualizações

Publicada em

Presentation at Big Data Competence Centre Dresden/Leipzig (ScaDS)

Publicada em: Educação
  • Entre para ver os comentários

(Big) bibliographic data @ ScaDS project meeting - 2015-06-12

  1. 1. (Big) Bibliographic Data UB Leipzig & SLUB Dresden ScaDS project meeting, 12.6.2015 Leander Seige, Felix Lohmeier, Ralf Talkenberger
  2. 2. “The library of the 21st century is a data hub.” quoted from an internal strategic paper of Leipzig University Library, 2015
  3. 3. simple bibliographic metadata <metadata> title author isbn publisher year … <resource> books serials newspapers articles ...
  4. 4. <resource> book ● printed books in the library’s shelves ● bought ebooks ● licensed ebooks ● pay-per-use ebooks ● free content ● ebooks to be bought by the library (patron driven acquisition = pda) ● even printed books to be bought by the library (pda too)
  5. 5. <resource> journals ● printed journals in the library’s shelves ● much more licensed electronic journals ○ full text accessible via web interfaces ● do we have article metadata? ● yes: licensed journal articles: 10s of millions per library
  6. 6. <metadata> accessibility information ● where is a ressource? (physical or on the net) ● who is allowed to access this content? (students? faculty? everyone?) ● is it available off-campus? ● did we buy it or is it just licensed? ● may the user copy or print it? ● is the library allowed to store the electronic file? ● may we grant access from wifi connections? ● ...or any combination of these...
  7. 7. <metadata> knowledge bases ● librarians built large knowledge bases to describe resources ● in german speaking countries: GND (Gemeinsame Normdatei) der Deutschen Nationalbibliothek http://www.dnb.de/EN/gnd ● international: http://viaf.org ● provide dbpedia-links to explore the linked data cloud and to enrich library data
  8. 8. <metadata> knowledge bases ● GND (and other national authority files via VIAF) ○ describe Persons, Corporate bodies, Conferences and Events, Geographic Information, Topics, Works and relationships between them ○ form a generic knowledge base, independent from any specific domain ○ provide links to other knowledge bases (dbpedia, geonames...)
  9. 9. resource discovery ● traditional “OPACs” provided access to traditional library resources like printed books, users had to use proprietary vendor drive portals to access electronic ressources ● today, printed materials represent only a small part of library resources ● in contrast: resource discovery systems aim to integrate all resources of a library and present them in one single search interface
  10. 10. Cooperation ● UBL and SLUB joined forces in March 2015 ● Goals: a. Exchange of metadata after processing b. Develop common workflows to avoid “double work” → integrate existing tools finc & d:swarm
  11. 11. finc Community ● maintains a large search engine infrastructure ● developed and hosted at Leipzig University Library ● based on Apache Solr und VuFind ● rugged metadata management system, processing millions of data records each day ● integrates more than 50 data sources https://finc.info
  12. 12. finc Community ● provides more than 15 university libraries with resource discovery systems ● offers great potential to design and implement user oriented functions on real world systems, serving thousands of library users in Saxony and beyond, every day ● employs the aggregated index at Leipzig University Library https://finc.info
  13. 13. 10% physical items 90% electronic content on the net aggregated index at Leipzig University Library
  14. 14. aggregated index at Leipzig University Library ● 12 million traditional data records (growing) ● 80 million electronic article data records (growing) ● each records contains 20 data fields 1.8 billion triple (if you triplify it) (without any enrichment data)
  15. 15. Data processing today ● distributed data storage ○ 2 Solr in Leipzig (~12 mio + ~80 mio records) ○ 2 Solr in Dresden (~2 mio + ~2 mio records) ● constraint: each data source is handled separately → difficult to build up relations and deep data integration
  16. 16. d:swarm ● yet another tool…? a. property graph database b. gui for library staff
  17. 17. Tools finc d:swarm focus data normalization data integration and enrichment technology script-based transformations (python, go, ElasticSearch) encapsulates metafacture (open source toolchain for metadata transformation) Property Graph (Neo4j) status Works fine with ~100 mio. records (less than one day) Scability issues (~ 4 mio. records in less than one day)
  18. 18. integrating finc with d:swarm ● enhance data processing regarding ○ authority data linking (NLP) ○ fuzzy deduplication ○ classification ○ relate bibliographic data to places, topics, abstract terms ○ publish machine readable data (linked data) ● create user interfaces to enable system librarians to control metadata processing
  19. 19. Tomorrow: common workflows ● All data flows through both tools (finc + d:swarm) ● Deduplication (in graphDB easier duplication recognition) ● FRBRization (aggregate different physical and formal versions of a work) ● Knowledge graph makes enrichment (authorities, altmetrics data, usage data, …) and analytics easier
  20. 20. Scalability issues ● current implementation of property graph is too slow ● test results with 64GB RAM, SSD, 16 cores ○ 1,2 mio records (flat format): 10 hours for complete workflow (ingest, transformation, export) ○ more complex formats (MARC21) up to 5x statements ● single Neo4j instance, storage and memory issues
  21. 21. d:swarm architecture
  22. 22. Possible solutions? ● “mit Hardware erschlagen” ● Another graphDB, parallelization? ○ ArangoDB: https://www.arangodb.com ○ Apache Giraph: http://giraph.apache.org ○ Blaze Graph: http://blazegraph.com (Wikidata’s choice) ● Gradoop?!