Academic talk, on Institutional Repository Single Sources of Truth, given to The University of Zambia [1] CSC 5741 postgraduate students [2].
[1] https://www.unza.zm
[2] http://lis.unza.zm/~lightonphiri/teaching/unza/2020/csc5741
1. Institutional Repository
Single Sources of Truth
Lighton Phiri <lighton.phiri@unza.zm>
DataLab Research Group
Department of Library & Information Science
University of Zambia
http://lis.unza.zm/~lightonphiri
Mining Descriptive Metadata from
Electronic Theses and Dissertations
Digital Object Bitstreams
2. 2June 9, 2020
About The DataLab Research Group at The
University of Zambia
● The DataLab research group at
The University of Zambia is
composed of faculty staff and
students—undergraduate and
postgraduate—working in
three main areas
○ Data Mining
○ Digital Libraries
○ Technology-Enhanced Learning
http://datalab.unza.zm
3. 3June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
4. 4June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
6. 6June 9, 2020
Digital Objects are Composed of an
Identifier, Bitstream and Metadata
● Identifier provides local and global uniqueness to objects
● Metadata provides auxiliary information about objects
● Bitstreams makeup the content associated with the objects
http://www.dlib.org/dlib/July95/07arms.html
7. 7June 9, 2020
Ingestion of Digital Objects into IRs Involves
Numerous Moving Parts
● Digital object bitstreams
need to be verified
● Digital object metadata
needs to be prepared
○ Preparing metadata is time
consuming and error-prone
since data needs to be
encoded using specific
metadata schemes: e.g
Dublin Core or ETD-ms
● Multi-step IR workflows http://open.uct.ac.za/handle/11427/29435
8. 8June 9, 2020
Supervised Machine Learning 101: Labeled
Data is Used to Train a Prediction Model
Gmail Spam Detection
● Model implemented by training estimators using using
labelled data
● f(x)—Spam/Not Spam classification
○ x—Email address, subject, email content
○ Labels—Mailbox user tagging; Gmail automatic detection
9. 9June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
10. 10June 9, 2020
The Online Visibility of Research Output
Influences Global Institution Rankings (1/2)
http://www.webometrics.info
11. 11June 9, 2020
The Online Visibility of Research Output
Influences Global Institution Rankings (1/2)
http://www.webometrics.info
12. 12June 9, 2020
The Online Visibility of Research Output
Influences Global Institution Rankings (2/2)
http://www.webometrics.info
13. 13June 9, 2020
The Online Visibility of Research Output
Influences Global Institution Rankings (2/2)
http://www.webometrics.info
14. 14June 9, 2020
The Online Visibility of Research Output
Influences Global Institution Rankings (2/2)
http://www.webometrics.info
15. 15June 9, 2020
Google Scholar Extraction: The University of
Zambia 2018 Scholarly Output Distribution
16. 16June 9, 2020
Google Scholar Extraction: The University of
Zambia 2018 Scholarly Output Distribution
17. 17June 9, 2020
Google Scholar Extraction: The University of
Zambia 2018 Scholarly Output Distribution
● 523 publications against
~854 faculty staff
● ONLY 476 (55.7%) faculty
staff with online
publications
18. 18June 9, 2020
Google Scholar Extraction: The University of
Zambia 2018 Scholarly Output Impact
19. 19June 9, 2020
Google Scholar Extraction: The University of
Zambia 2018 Scholarly Output Impact
20. 20June 9, 2020
Google Scholar Extraction: The University of
Zambia 2018 Scholarly Output Impact
30% of faculty
staff research
with ZERO
impact
21. 21June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
22. 22June 9, 2020
Challenges With Ingestions of “ETDs” into
The UNZA IR
● Ingestion of ETDs is not timely
● Huge time gaps between
submission and eventual
ingestion of ETDs
● ETDs have missing and
incorrect descriptive
metadata
Phiri, L. (2018)
“Towards Increased Online Visibility of Scholarly Research Output in Zambia”.
URL: http://lis.unza.zm/archive/handle/123456789/227
23. 23June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
24. 24June 9, 2020
Towards a Zambia National Electronic
Theses and Dissertations Portal (1/2)
● Works underway to
deploy a national ETD
portal
○ IRs implement open
standards and
protocols that
facilitate harvesting
of digital objects
● Stakeholders
required to
harmonise curation http://lis.unza.zm/portal
25. 25June 9, 2020
Towards a Zambia National Electronic
Theses and Dissertations Portal (2/2)
http://www.hea.org.zm
● Ideally the portal is aimed at providing central access to all
ETDs produced by HEIs in Zambia
○ Source of knowledge for not-for-profit organisations and policy
makers
○ Source of knowledge for taxpayers—entities that fund research
26. 26June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
27. 27June 9, 2020
A Multi-Faceted Multi-Stakeholder Approach
for Increased Visibility of ETDs (1/2)
● Experimentation with various
approaches and techniques
○ System-driven approaches
○ Data-driven approaches
○ People-driven approaches
● Involve key stakeholders in
order to mainstream the
implementation of IRs in
Zambian HEIs
○ Knowledge exchange with HEIs
in Zambia
28. 28June 9, 2020
A Multi-Faceted Multi-Stakeholder Approach
for Increased Visibility of ETDs (2/2)
29. 29June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
30. 30June 9, 2020
Subject Specific Institutional Repositories as
Potential Solutions for Delayed Ingestions
● Is it feasible to use subject
institutional repositories at
The UNZA?
○ Subject IR setup in the
Department of Library and
Information Science
○ Subject IR scored a mean SUS
score of 65.5—”Okay-> Good”
Mbewe, M., Moonga, H., Mwewa, M., Sikazindu, N., and Mwale, N. (2018)
“Investigating the Feasibility of Using Subject IRs: A Case Study at The UNZA”.
URL: http://lis.unza.zm/archive/handle/123456789/189
http://lis.unza.zm/archive
31. 31June 9, 2020
Identification of Subject Controlled
Vocabularies Should Involve Stakeholders
● What is the effect of
integrating controlled
vocabularies sets in IRs?
○ Sanbox IR integrated with
LCSH vocabulary set
○ Mean SUS score computed
○ 68.6—With Vocabularies
○ 65.1—No Vocabularies
Chipangila, B., Liswaniso, E., Mawila, A., Mwanza, P., and Nawila, D. (2019)
“Effectiveness of Integrating Controlled Subject Vocabulary Sets in the UNZA IR”.
URL: http://lis.unza.zm/archive/handle/123456789/2236
32. 32June 9, 2020
Effective Workflows for Ingestion of
Electronic Theses and Dissertations
● What workflows would
be effective at reducing
the turnaround time
between submission and
ingestions of ETDs
○ Stakeholders involved in
submission workflow of
ETDs
Banda, A. (2019--Present)
“Investigating the Workflows Involved in the Ingestion of ETD into IRs”.
Work-in-Progresss
33. 33June 9, 2020
Improved Electronic Theses and Dissertation
Metadata Quality
● How can high-quality ETD
matadata be ingested into
IRs?
● How can submission
workflows be modified to
facilitate ingestion of ETD-ms
encoded metadata?
Banda, M., Chinyama, A., Maambo, H., Mulomba, H., and Mwitwa, K. (2020--Present)
“Improved Electronic Theses and Dissertation Metadata Quality”.
Work-in-Progresss
http://www.ndltd.org/standards/metadata
34. 34June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
35. 35June 9, 2020
Timely and Consistent Ingestion of ETDs Can
be Facilitated by Automatic Classification
● Implementation of classification models to
automatically classify IR digital objects
using the minimum possible input from
graduate students: “The ETD Manuscript”
○ The ETD manuscript bitstream is considered
the “single source of truth”
○ Metadata prepared by staff that work with IR
potentially have inconsistencies
Phiri, L. (in press)
“Automatic Classification of Digital Objects for Improved Metadata Quality of ETDs”.
International Journal of Metadata, Semantics and Ontologies
36. 36June 9, 2020
Three Classification Models Implemented for
Classifying ETD Type, Subject and Collection
● Text features extracted from a set of core
bitstream portions—ETD Title, ETD
Abstract, ETD Title Page and ETD pages—to
classify ETD manuscripts
ETD Type
ETD Subjects
IR Collection
37. 37June 9, 2020
Text Features are Extracted from Single
Source of Truth—Bitstream
● Textual content mined
from PDF manuscripts
○ Cover/title pages
○ Preliminary pages
● Textual content mined
from metadata for
training
● PDF document metadata
● Curated datasets from
external repositories
38. 38June 9, 2020
Text on Title Pages Exhibit Characteristics
that Enable ETD Classification
● Textual content mined
from PDF manuscripts
○ Cover/title pages
○ Preliminary pages
● Textual content mined
from metadata for
training
● PDF document metadata
● Curated datasets from
external repositories
39. 39June 9, 2020
Descriptive Dublin Core Encoded Metadata
Used for Training Classifiers
● Textual content mined
from PDF manuscripts
○ Cover/title pages
○ Preliminary pages
● Textual content mined
from metadata for
training
● PDF document metadata
● Curated datasets from
external repositories
40. 40June 9, 2020
PDF Metadata Provide Auxiliary Metadata
Information Such as Total Pages
● Textual content mined
from PDF manuscripts
○ Cover/title pages
○ Preliminary pages
● Textual content mined
from metadata for
training
● PDF document metadata
● Curated datasets from
external repositories
41. 41June 9, 2020
Metadata was Harvested Using OAI-PMH,
Bistreams were Harvested Using OAI-ORE
● OAI-PMH used to
harvest all ETD
descriptive metadata
elements
● OAI-ORE used to
harvest all ETD PDF
documents
42. 42June 9, 2020
The Models are Reasonably Accuracy to be
Integrated with IR
● ETD Type—98.1%
● ETD Collection— 81.1%
● ETD Subjects—81.7%
● The models would still
need to be
incorporated into an
application that
requires “some”
human intervention
43. 43June 9, 2020
Models are Deployed as Flask-Based API
Endpoints (1/2)
https://github.com/lightonphiri/etd_autoclassifier
44. 44June 9, 2020
Models are Deployed as Flask-Based API
Endpoints (2/2)
https://datalab-apis.herokuapp.com/api/collection
45. 45June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
46. 46June 9, 2020
Formatting Guidelines for ETD Metadata
Harvested by National ETD Portal
http://lis.unza.zm/portal
48. 48June 9, 2020
Effective Workflows for Ingestion of
Electronic Theses and Dissertations
● What is the feasibility of implementing
machine learning models for
automatically classifying IR objects?
○ This work goes beyond ETDs to include
other digital object types
● How can downstream services be
integrated with the models?
M’sendo, R. (2019--Present)
“Mult-Faceted Automatic Classification of Institutional Repository Digital Objects”.
Work-in-Progress
49. 49June 9, 2020
Online Visibility of Research can Potentially
Change the Global Reputation of HEIs
http://www.webometrics.info
50. 50June 9, 2020
Q & A Session
● Comments, concerns and complaints?
51. [1] Phiri, L. (2018). Research Visibility in the Global South: Towards
Increased Online Visibility of Scholarly Research Output in
Zambia. IEEE International Conference in Information and
Communication Technologies.
[2] Phiri, L. (2020). A Multi-Faceted Multi-Stakeholder Approach for
Increased Visibility of ETDs in Zambia. Cadernos BAD, (1).
https://doi.org/10.1017/S0269888910000032
[3] Arms, W. Y. (1995). Key concepts in the architecture of the digital
library. D-lib Magazine, 1(1).
Bibliography