Identifying the research topics that best describe the scope of a scientific publication is a crucial task for editors, in particular because the quality of these annotations determine how effectively users are able to discover the right content in online libraries. For this reason, Springer Nature, the world's largest academic book publisher, has traditionally entrusted this task to their most expert editors. These editors manually analyse all new books, possibly including hundreds of chapters, and produce a list of the most relevant topics. Hence, this process has traditionally been very expensive, time-consuming, and confined to a few senior editors. For these reasons, back in 2016 we developed Smart Topic Miner (STM), an ontology-driven application that assists the Springer Nature editorial team in annotating the volumes of all books covering conference proceedings in Computer Science. Since then STM has been regularly used by editors in Germany, China, Brazil, India, and Japan, for a total of about 800 volumes per year. Over the past three years the initial prototype has iteratively evolved in response to feedback from the users and evolving requirements. In this paper we present the most recent version of the tool and describe the evolution of the system over the years, the key lessons learnt, and the impact on the Springer Nature workflow. In particular, our solution has drastically reduced the time needed to annotate proceedings and significantly improved their discoverability, resulting in 9.3 million additional downloads. We also present a user study involving 9 editors, which yielded excellent results in term of usability, and report an evaluation of the new topic classifier used by STM, which outperforms previous versions in recall and F-measure.
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
ISWC 2019 - Improving Editorial Workflow and Metadata Quality at Springer Nature
1. Improving Editorial Workflow and Metadata
Quality at Springer Nature
Angelo Salatino1, Francesco Osborne1,
Aliaksandr Birukou2, Enrico Motta1
1
Knowledge Media Institute, The Open University, United Kingdom
2
Springer Nature, Heidelberg, Germany
ISWC 2019
2. Open University and Springer Nature Collaboration
The Open University and Springer Nature have been collaborating since 2014 in
the development of an array of semantically-enhanced solutions for:
Osborne et al. (2017) Supporting Springer Nature Editors by means of Semantic Technologies. ISWC 2017. Vienna, Austria.
• Semi-automatic classification of proceedings
and other editorial products.
• Automatic selection of the most appropriate
books, journals, and proceedings to market at a
scientific event.
• Analysis of SN codes, with the aim of evolving
marked codes and detecting fields that deserve
further attention.
• Joint release of the Computer Science Ontology.
3. Generation of Metadata
It is a crucial task to enable scholars, students, companies and other stakeholders to
discover and access this knowledge.
Traditionally, editors choose a list of related
keywords and categories in relevant taxonomies
according to:
• their own experience of similar conferences;
• a visual exploration of titles and abstracts;
• a list of terms given by the curators or derived
by calls for papers.
4. Classification of Publications – A Complex Problem
Classify publications manually presents a number of issues for
a large editor such as Springer Nature.
• It a complex process that require expert editors
• It is time-consuming process which can hardly scale
• It is easy to miss the emergence of new topics
• It is easy to assume that some traditional topics are still
popular when this is no longer the case
• The keywords used in the call of papers are often a reflection
of what a venue aspires to be, rather than the real contents of
the proceedings.
6. Smart Topic Miner 1.0 - 2016
Presented at ISWC 2016
Osborne, F., Salatino, A., Birukou, A. and Motta,
E.: Automatic Classification of Springer Nature
Proceedings with Smart Topic Miner. ISWC 2016
7. A success story
• Since 2016 STM had been regularly used by editors in Germany,
China, Brazil, India, and Japan.
• It is used to classify more than 800 conference proceedings
volume per year including the Lecture Notes in Computer Science
(LNCS) as well as LNBIP, CCIS, IFIP-AICT, LNICST.
• It changed completely SN internal workflow: now the task is semi-
automatic and monitored by junior editors.
• It is constantly evolving and including new functionalities,
following the feedback from the editorial team.
11. Business Value
• STM halves the time needed for classifying proceedings from
30 to 15 minutes.
• It allows also junior editors to work on the classification of
proceedings, distributing the load and reducing costs.
• The adoption of a controlled vocabulary makes the process
more robust and facilitates the identification of related
editorial products.
11
12. Retrievability
About 9M of additional downloads thanks to STM.
0
5000
10000
15000
20000
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Average number of yearly downloads
for books in SpringerLink
downloads (CS Proceedings) expected downloads (CS Proceedings)
downloads (CS Proceedings) withSTM downloads (other books in CS)
downloads (overall)
14. Smart Topic Miner 2.0 - 2019
• New GUI.
• New Knowledge Base (CSO).
• New Topic Detection Engine
(CSO Classifier).
• Ability to compare with
previous editions.
• Integrated with SN system
and CSO Portal.
http://stm-demo.kmi.open.ac.uk
15. SN Editors
HTML - GUI
Parser
Generate
Visualizations
STM Engine
CSO
SNCs
Historical
Data
i) CSO Classifier
ii) Topic Explanation
iii) Taxonomy Generation
iv) SN Tags Inference
v) Previous Classification
word2vec model
STM 2.0 - architecture
16. A new knowledge base - The Computer Science
Ontology
The Computer Science Ontology (CSO) is a large-scale, automatically generated
ontology of research areas. It is the largest ontology in the field of Computer Science,
including about 14K topics and 162K semantic relationships.
Salatino et al (2019) The Computer Science Ontology: A Comprehensive Automatically-Generated Taxonomy of Research Areas. Data Intelligence.
http://cso.kmi.open.ac.uk/
17. A new topic detection engine - The CSO Classifier
The CSO Classifier is a unsupervised approach for automatically classifying documents
according to CSO.
Salatino et al. (2019) The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles.
https://cso.kmi.open.ac.uk/classify/
https://cso.kmi.open.ac.uk/classify/
https://github.com/angelosalatino/cso-classifier
Download
Demo
pip install cso-classifier
19. Evaluation - Performance
Classifier Description Prec. Rec. F1
TF-IDF TF-IDF 16.7% 24.0% 19.7%
TF-IDF-M TF-IDF mapped to CSO concepts. 40.4% 24.1% 30.1%
LDA100 LDA with 100 topics. 5.9% 11.9% 7.9%
LDA500 LDA with 500 topics. 4.2% 12.5% 6.3%
LDA1000 LDA with 1000 topics. 3.8% 5.0% 4.3%
LDA100-M LDA with 100 topics mapped to CSO. 9.4% 19.3% 12.6%
LDA500-M LDA with 500 topics mapped to CSO. 9.6% 21.2% 13.2%
LDA1000-M LDA with 1000 topics mapped to CSO. 12.0% 11.5% 11.7%
W2V-W W2V on windows of words. 41.2% 16.7% 23.8%
STM - 2016 Classifier used by STM 1.0. 80.8% 58.2% 67.6%
STM – 2017 (CSO-SYN) CSO Classifier -Syntactic module. 78.3% 63.8% 70.3%
CSO-SEM CSO Classifier -Semantic module. 70.8% 72.2% 71.5%
STM – 2019 (CSO-C) The CSO Classifier. 73.0% 75.3% 74.1%
Computed on a GS of 70 publications, each annotated by 3 researchers.
20. Evaluation - Usability
System SUS score Grade Percentile
STM 2016 76.6 B 80%
STM 2019 82.8 A 93%
0 20 40 60 80 100
Editor 4
Editor 1
Editor 9
Editor 5
Editor 6
Editor 7
Editor 3
Editor 2
Editor 8
SUS Score
0 1 2 3 4 5
Editor 4
Editor 1
Editor 9
Editor 5
Editor 6
Editor 7
Editor 3
Editor 2
Editor 8
SUS Categories
Want to use frequently Easy to use
Easy to Learn Too complex
21. Conclusion and Future Work
• “A little semantic goes a long way”
• Semantic explainability is crucial in this domain
• We are working on an application that will support authors in
annotating their own papers.
• Typing of scientific entities: approaches, tasks, domains,
resources.
• Automatic extraction of Scientific Knowledge Graph.
Smart Topic Miner (STM) is the system that we created for assisting the Springer Nature editorial team in classifying scholarly publications in the field of Computer Science. It takes in input one or more books and returns a representation of its research topics, a description of each chapter, and an explanation for each inferred topic.
STM has been used by Springer Nature since January 2017 to annotate several book series in Computer Science (e.g., LNCS) for a total of about 800 volumes each year. During this period, the adoption of STM has halved the time needed for classifying proceedings and allowed a more robust and comprehensive representation of the research areas in the Springer Nature catalogue.
Smart Topic Miner (STM) is the system that we created for assisting the Springer Nature editorial team in classifying scholarly publications in the field of Computer Science. It takes in input one or more books and returns a representation of its research topics, a description of each chapter, and an explanation for each inferred topic.
STM has been used by Springer Nature since January 2017 to annotate several book series in Computer Science (e.g., LNCS) for a total of about 800 volumes each year. During this period, the adoption of STM has halved the time needed for classifying proceedings and allowed a more robust and comprehensive representation of the research areas in the Springer Nature catalogue.
Smart Topic Miner (STM) is the system that we created for assisting the Springer Nature editorial team in classifying scholarly publications in the field of Computer Science. It takes in input one or more books and returns a representation of its research topics, a description of each chapter, and an explanation for each inferred topic.
STM has been used by Springer Nature since January 2017 to annotate several book series in Computer Science (e.g., LNCS) for a total of about 800 volumes each year. During this period, the adoption of STM has halved the time needed for classifying proceedings and allowed a more robust and comprehensive representation of the research areas in the Springer Nature catalogue.
Smart Topic Miner (STM) is the system that we created for assisting the Springer Nature editorial team in classifying scholarly publications in the field of Computer Science. It takes in input one or more books and returns a representation of its research topics, a description of each chapter, and an explanation for each inferred topic.
STM has been used by Springer Nature since January 2017 to annotate several book series in Computer Science (e.g., LNCS) for a total of about 800 volumes each year. During this period, the adoption of STM has halved the time needed for classifying proceedings and allowed a more robust and comprehensive representation of the research areas in the Springer Nature catalogue.
Smart Topic Miner (STM) is the system that we created for assisting the Springer Nature editorial team in classifying scholarly publications in the field of Computer Science. It takes in input one or more books and returns a representation of its research topics, a description of each chapter, and an explanation for each inferred topic.
STM has been used by Springer Nature since January 2017 to annotate several book series in Computer Science (e.g., LNCS) for a total of about 800 volumes each year. During this period, the adoption of STM has halved the time needed for classifying proceedings and allowed a more robust and comprehensive representation of the research areas in the Springer Nature catalogue.
Smart Topic Miner (STM) is the system that we created for assisting the Springer Nature editorial team in classifying scholarly publications in the field of Computer Science. It takes in input one or more books and returns a representation of its research topics, a description of each chapter, and an explanation for each inferred topic.
STM has been used by Springer Nature since January 2017 to annotate several book series in Computer Science (e.g., LNCS) for a total of about 800 volumes each year. During this period, the adoption of STM has halved the time needed for classifying proceedings and allowed a more robust and comprehensive representation of the research areas in the Springer Nature catalogue.
Smart Topic Miner (STM) is the system that we created for assisting the Springer Nature editorial team in classifying scholarly publications in the field of Computer Science. It takes in input one or more books and returns a representation of its research topics, a description of each chapter, and an explanation for each inferred topic.
STM has been used by Springer Nature since January 2017 to annotate several book series in Computer Science (e.g., LNCS) for a total of about 800 volumes each year. During this period, the adoption of STM has halved the time needed for classifying proceedings and allowed a more robust and comprehensive representation of the research areas in the Springer Nature catalogue.
In the scholarly domain, ontologies are often used to facilitate the integration of large datasets of research data, the exploration of the academic landscape, information extraction from scientific articles, and so on.
On January 2019, KMi released, in conjunction with Springer Nature, the Computer Science Ontology (CSO), which is the largest taxonomy of research areas in the field. This resource was automatically generated by mining a dataset of 16M publications and using a combination of machine learning and semantic technologies to extract 14K research topics and 162K semantic relationships. CSO includes a much larger number of research topics than the alternatives (e.g., ACM Classification), enabling a very granular characterisation of the content of research papers, and it can be easily updated by running our ontology learning approach on recent corpora of publications. It attracted the attentional of several institutions and companies, such as Digital Science, Elsevier, and ACM, interested in adopting CSO for characterizing their datasets of research publications.
We are currently developing a similar ontology in the field of Engineering and we plan of applying our technology on several other fields (Biomedical, Economics).
Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment.
The CSO Classifier is an application for automatically classifying research papers according to CSO. We are currently using it to enrich the description of 150K publications on Springer Nature online library. We also started a collaboration with Digital Science, the creators of Dimensions, with the aim of automatically annotating their dataset of scholarly data.
The resulting characterization of research papers can be used for supporting tasks such as identifying research communities, forecasting research trends, detecting relevant reviewers, and so on.