SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
Institutional Repository
Single Sources of Truth
Lighton Phiri <lighton.phiri@unza.zm>
DataLab Research Group
Department of Library & Information Science
University of Zambia
http://lis.unza.zm/~lightonphiri
Mining Descriptive Metadata from
Electronic Theses and Dissertations
Digital Object Bitstreams
2June 9, 2020
About The DataLab Research Group at The
University of Zambia
● The DataLab research group at
The University of Zambia is
composed of faculty staff and
students—undergraduate and
postgraduate—working in
three main areas
○ Data Mining
○ Digital Libraries
○ Technology-Enhanced Learning
http://datalab.unza.zm
3June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
4June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
5June 9, 2020
Institutional Repositories are a Specialised
Type of Digital Library Management Systems
6June 9, 2020
Digital Objects are Composed of an
Identifier, Bitstream and Metadata
● Identifier provides local and global uniqueness to objects
● Metadata provides auxiliary information about objects
● Bitstreams makeup the content associated with the objects
http://www.dlib.org/dlib/July95/07arms.html
7June 9, 2020
Ingestion of Digital Objects into IRs Involves
Numerous Moving Parts
● Digital object bitstreams
need to be verified
● Digital object metadata
needs to be prepared
○ Preparing metadata is time
consuming and error-prone
since data needs to be
encoded using specific
metadata schemes: e.g
Dublin Core or ETD-ms
● Multi-step IR workflows http://open.uct.ac.za/handle/11427/29435
8June 9, 2020
Supervised Machine Learning 101: Labeled
Data is Used to Train a Prediction Model
Gmail Spam Detection
● Model implemented by training estimators using using
labelled data
● f(x)—Spam/Not Spam classification
○ x—Email address, subject, email content
○ Labels—Mailbox user tagging; Gmail automatic detection
9June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
10June 9, 2020
The Online Visibility of Research Output
Influences Global Institution Rankings (1/2)
http://www.webometrics.info
11June 9, 2020
The Online Visibility of Research Output
Influences Global Institution Rankings (1/2)
http://www.webometrics.info
12June 9, 2020
The Online Visibility of Research Output
Influences Global Institution Rankings (2/2)
http://www.webometrics.info
13June 9, 2020
The Online Visibility of Research Output
Influences Global Institution Rankings (2/2)
http://www.webometrics.info
14June 9, 2020
The Online Visibility of Research Output
Influences Global Institution Rankings (2/2)
http://www.webometrics.info
15June 9, 2020
Google Scholar Extraction: The University of
Zambia 2018 Scholarly Output Distribution
16June 9, 2020
Google Scholar Extraction: The University of
Zambia 2018 Scholarly Output Distribution
17June 9, 2020
Google Scholar Extraction: The University of
Zambia 2018 Scholarly Output Distribution
● 523 publications against
~854 faculty staff
● ONLY 476 (55.7%) faculty
staff with online
publications
18June 9, 2020
Google Scholar Extraction: The University of
Zambia 2018 Scholarly Output Impact
19June 9, 2020
Google Scholar Extraction: The University of
Zambia 2018 Scholarly Output Impact
20June 9, 2020
Google Scholar Extraction: The University of
Zambia 2018 Scholarly Output Impact
30% of faculty
staff research
with ZERO
impact
21June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
22June 9, 2020
Challenges With Ingestions of “ETDs” into
The UNZA IR
● Ingestion of ETDs is not timely
● Huge time gaps between
submission and eventual
ingestion of ETDs
● ETDs have missing and
incorrect descriptive
metadata
Phiri, L. (2018)
“Towards Increased Online Visibility of Scholarly Research Output in Zambia”.
URL: http://lis.unza.zm/archive/handle/123456789/227
23June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
24June 9, 2020
Towards a Zambia National Electronic
Theses and Dissertations Portal (1/2)
● Works underway to
deploy a national ETD
portal
○ IRs implement open
standards and
protocols that
facilitate harvesting
of digital objects
● Stakeholders
required to
harmonise curation http://lis.unza.zm/portal
25June 9, 2020
Towards a Zambia National Electronic
Theses and Dissertations Portal (2/2)
http://www.hea.org.zm
● Ideally the portal is aimed at providing central access to all
ETDs produced by HEIs in Zambia
○ Source of knowledge for not-for-profit organisations and policy
makers
○ Source of knowledge for taxpayers—entities that fund research
26June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
27June 9, 2020
A Multi-Faceted Multi-Stakeholder Approach
for Increased Visibility of ETDs (1/2)
● Experimentation with various
approaches and techniques
○ System-driven approaches
○ Data-driven approaches
○ People-driven approaches
● Involve key stakeholders in
order to mainstream the
implementation of IRs in
Zambian HEIs
○ Knowledge exchange with HEIs
in Zambia
28June 9, 2020
A Multi-Faceted Multi-Stakeholder Approach
for Increased Visibility of ETDs (2/2)
29June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
30June 9, 2020
Subject Specific Institutional Repositories as
Potential Solutions for Delayed Ingestions
● Is it feasible to use subject
institutional repositories at
The UNZA?
○ Subject IR setup in the
Department of Library and
Information Science
○ Subject IR scored a mean SUS
score of 65.5—”Okay-> Good”
Mbewe, M., Moonga, H., Mwewa, M., Sikazindu, N., and Mwale, N. (2018)
“Investigating the Feasibility of Using Subject IRs: A Case Study at The UNZA”.
URL: http://lis.unza.zm/archive/handle/123456789/189
http://lis.unza.zm/archive
31June 9, 2020
Identification of Subject Controlled
Vocabularies Should Involve Stakeholders
● What is the effect of
integrating controlled
vocabularies sets in IRs?
○ Sanbox IR integrated with
LCSH vocabulary set
○ Mean SUS score computed
○ 68.6—With Vocabularies
○ 65.1—No Vocabularies
Chipangila, B., Liswaniso, E., Mawila, A., Mwanza, P., and Nawila, D. (2019)
“Effectiveness of Integrating Controlled Subject Vocabulary Sets in the UNZA IR”.
URL: http://lis.unza.zm/archive/handle/123456789/2236
32June 9, 2020
Effective Workflows for Ingestion of
Electronic Theses and Dissertations
● What workflows would
be effective at reducing
the turnaround time
between submission and
ingestions of ETDs
○ Stakeholders involved in
submission workflow of
ETDs
Banda, A. (2019--Present)
“Investigating the Workflows Involved in the Ingestion of ETD into IRs”.
Work-in-Progresss
33June 9, 2020
Improved Electronic Theses and Dissertation
Metadata Quality
● How can high-quality ETD
matadata be ingested into
IRs?
● How can submission
workflows be modified to
facilitate ingestion of ETD-ms
encoded metadata?
Banda, M., Chinyama, A., Maambo, H., Mulomba, H., and Mwitwa, K. (2020--Present)
“Improved Electronic Theses and Dissertation Metadata Quality”.
Work-in-Progresss
http://www.ndltd.org/standards/metadata
34June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
35June 9, 2020
Timely and Consistent Ingestion of ETDs Can
be Facilitated by Automatic Classification
● Implementation of classification models to
automatically classify IR digital objects
using the minimum possible input from
graduate students: “The ETD Manuscript”
○ The ETD manuscript bitstream is considered
the “single source of truth”
○ Metadata prepared by staff that work with IR
potentially have inconsistencies
Phiri, L. (in press)
“Automatic Classification of Digital Objects for Improved Metadata Quality of ETDs”.
International Journal of Metadata, Semantics and Ontologies
36June 9, 2020
Three Classification Models Implemented for
Classifying ETD Type, Subject and Collection
● Text features extracted from a set of core
bitstream portions—ETD Title, ETD
Abstract, ETD Title Page and ETD pages—to
classify ETD manuscripts
ETD Type
ETD Subjects
IR Collection
37June 9, 2020
Text Features are Extracted from Single
Source of Truth—Bitstream
● Textual content mined
from PDF manuscripts
○ Cover/title pages
○ Preliminary pages
● Textual content mined
from metadata for
training
● PDF document metadata
● Curated datasets from
external repositories
38June 9, 2020
Text on Title Pages Exhibit Characteristics
that Enable ETD Classification
● Textual content mined
from PDF manuscripts
○ Cover/title pages
○ Preliminary pages
● Textual content mined
from metadata for
training
● PDF document metadata
● Curated datasets from
external repositories
39June 9, 2020
Descriptive Dublin Core Encoded Metadata
Used for Training Classifiers
● Textual content mined
from PDF manuscripts
○ Cover/title pages
○ Preliminary pages
● Textual content mined
from metadata for
training
● PDF document metadata
● Curated datasets from
external repositories
40June 9, 2020
PDF Metadata Provide Auxiliary Metadata
Information Such as Total Pages
● Textual content mined
from PDF manuscripts
○ Cover/title pages
○ Preliminary pages
● Textual content mined
from metadata for
training
● PDF document metadata
● Curated datasets from
external repositories
41June 9, 2020
Metadata was Harvested Using OAI-PMH,
Bistreams were Harvested Using OAI-ORE
● OAI-PMH used to
harvest all ETD
descriptive metadata
elements
● OAI-ORE used to
harvest all ETD PDF
documents
42June 9, 2020
The Models are Reasonably Accuracy to be
Integrated with IR
● ETD Type—98.1%
● ETD Collection— 81.1%
● ETD Subjects—81.7%
● The models would still
need to be
incorporated into an
application that
requires “some”
human intervention
43June 9, 2020
Models are Deployed as Flask-Based API
Endpoints (1/2)
https://github.com/lightonphiri/etd_autoclassifier
44June 9, 2020
Models are Deployed as Flask-Based API
Endpoints (2/2)
https://datalab-apis.herokuapp.com/api/collection
45June 9, 2020
Outline
● Definitions and Key Concepts
● Contextualisation
● The Problem
● The Bigger Picture
● What We Think Can Work
● Past and Current Projects
● Data-Driven Ingestion of ETDs
● Conclusion and Future Work
46June 9, 2020
Formatting Guidelines for ETD Metadata
Harvested by National ETD Portal
http://lis.unza.zm/portal
47June 9, 2020
Automatic ETD-ms Metadata Generation
http://www.ndltd.org/standards/metadata
48June 9, 2020
Effective Workflows for Ingestion of
Electronic Theses and Dissertations
● What is the feasibility of implementing
machine learning models for
automatically classifying IR objects?
○ This work goes beyond ETDs to include
other digital object types
● How can downstream services be
integrated with the models?
M’sendo, R. (2019--Present)
“Mult-Faceted Automatic Classification of Institutional Repository Digital Objects”.
Work-in-Progress
49June 9, 2020
Online Visibility of Research can Potentially
Change the Global Reputation of HEIs
http://www.webometrics.info
50June 9, 2020
Q & A Session
● Comments, concerns and complaints?
[1] Phiri, L. (2018). Research Visibility in the Global South: Towards
Increased Online Visibility of Scholarly Research Output in
Zambia. IEEE International Conference in Information and
Communication Technologies.
[2] Phiri, L. (2020). A Multi-Faceted Multi-Stakeholder Approach for
Increased Visibility of ETDs in Zambia. Cadernos BAD, (1).
https://doi.org/10.1017/S0269888910000032
[3] Arms, W. Y. (1995). Key concepts in the architecture of the digital
library. D-lib Magazine, 1(1).
Bibliography
lighton.phiri@unza.zm
http://datalab.unza.zm
http://lis.unza.zm/~lightonphiri

Mais conteúdo relacionado

Mais procurados

A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...
IJECEIAES
 

Mais procurados (11)

Big data trends in 2020
Big data trends in 2020Big data trends in 2020
Big data trends in 2020
 
New Data for Innovation Policy
New Data for Innovation PolicyNew Data for Innovation Policy
New Data for Innovation Policy
 
CODATA: Open Data, FAIR Data and Open Science/Simon Hodson
CODATA: Open Data, FAIR Data and Open Science/Simon HodsonCODATA: Open Data, FAIR Data and Open Science/Simon Hodson
CODATA: Open Data, FAIR Data and Open Science/Simon Hodson
 
A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...A forecasting of stock trading price using time series information based on b...
A forecasting of stock trading price using time series information based on b...
 
Diffusion of Big Data and Analytics in Developing Countries
Diffusion of Big Data and Analytics in Developing CountriesDiffusion of Big Data and Analytics in Developing Countries
Diffusion of Big Data and Analytics in Developing Countries
 
Digital notebooks - a Jisc perspective
Digital notebooks - a Jisc perspectiveDigital notebooks - a Jisc perspective
Digital notebooks - a Jisc perspective
 
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...
 
Griffiths lace workshop-eden-2016
Griffiths lace workshop-eden-2016Griffiths lace workshop-eden-2016
Griffiths lace workshop-eden-2016
 
Deploying Open Learning Analytics at a National Scale
Deploying Open Learning Analytics at a National ScaleDeploying Open Learning Analytics at a National Scale
Deploying Open Learning Analytics at a National Scale
 
African Open Science Platform
African Open Science PlatformAfrican Open Science Platform
African Open Science Platform
 
Data education and skills initiatives
Data education and skills initiativesData education and skills initiatives
Data education and skills initiatives
 

Semelhante a Institutional Repository Single Sources of Truth

A presentation on Applications of ICT in Research.pptx
A presentation on Applications of ICT in Research.pptxA presentation on Applications of ICT in Research.pptx
A presentation on Applications of ICT in Research.pptx
ROHITSHARMA779690
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
IJDKP
 

Semelhante a Institutional Repository Single Sources of Truth (20)

Discovering Insight from Scholarly Research Output in Higher Educational Inst...
Discovering Insight from Scholarly Research Output in Higher Educational Inst...Discovering Insight from Scholarly Research Output in Higher Educational Inst...
Discovering Insight from Scholarly Research Output in Higher Educational Inst...
 
Quantum Mechanics meet Information Search and Retrieval – The QUARTZ Project
Quantum Mechanics meet Information Search and Retrieval – The QUARTZ ProjectQuantum Mechanics meet Information Search and Retrieval – The QUARTZ Project
Quantum Mechanics meet Information Search and Retrieval – The QUARTZ Project
 
Rdaeu russia_fg_1_july2014_final
Rdaeu  russia_fg_1_july2014_finalRdaeu  russia_fg_1_july2014_final
Rdaeu russia_fg_1_july2014_final
 
SR-R-nKAnwar_PPM_Penulisan_ProposalLPDP.pdf
SR-R-nKAnwar_PPM_Penulisan_ProposalLPDP.pdfSR-R-nKAnwar_PPM_Penulisan_ProposalLPDP.pdf
SR-R-nKAnwar_PPM_Penulisan_ProposalLPDP.pdf
 
ppt1.pptx
ppt1.pptxppt1.pptx
ppt1.pptx
 
IFLA ARL Hot Topics 2020: Libraries as Catalysts - Inspire, Engage, Enable, C...
IFLA ARL Hot Topics 2020: Libraries as Catalysts - Inspire, Engage, Enable, C...IFLA ARL Hot Topics 2020: Libraries as Catalysts - Inspire, Engage, Enable, C...
IFLA ARL Hot Topics 2020: Libraries as Catalysts - Inspire, Engage, Enable, C...
 
Case study on gina(gobal innovation network and analysis)
Case study on gina(gobal innovation network and analysis)Case study on gina(gobal innovation network and analysis)
Case study on gina(gobal innovation network and analysis)
 
Big data
Big dataBig data
Big data
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
Academic Innovation Data Showcase 2-14-19
Academic Innovation Data Showcase 2-14-19Academic Innovation Data Showcase 2-14-19
Academic Innovation Data Showcase 2-14-19
 
A presentation on Applications of ICT in Research.pptx
A presentation on Applications of ICT in Research.pptxA presentation on Applications of ICT in Research.pptx
A presentation on Applications of ICT in Research.pptx
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
Proposal for the Theme on Big Data.pdf
Proposal for the Theme on Big Data.pdfProposal for the Theme on Big Data.pdf
Proposal for the Theme on Big Data.pdf
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )International Journal of Data Mining & Knowledge Management Process ( IJDKP )
International Journal of Data Mining & Knowledge Management Process ( IJDKP )
 
Open Data and Big Data Capacity Building Initiative
Open Data and Big Data Capacity Building InitiativeOpen Data and Big Data Capacity Building Initiative
Open Data and Big Data Capacity Building Initiative
 

Mais de Lighton Phiri

Mais de Lighton Phiri (20)

Enterprise Medical Imaging for Streamlined Radiological Diagnosis in Zambian...
Enterprise Medical Imaging for Streamlined Radiological Diagnosis  in Zambian...Enterprise Medical Imaging for Streamlined Radiological Diagnosis  in Zambian...
Enterprise Medical Imaging for Streamlined Radiological Diagnosis in Zambian...
 
User Centred Design and Implementation of Useful Picture Archiving and Commun...
User Centred Design and Implementation of Useful Picture Archiving and Commun...User Centred Design and Implementation of Useful Picture Archiving and Commun...
User Centred Design and Implementation of Useful Picture Archiving and Commun...
 
Enterprise Medical Imaging for Improved Radiological Workflows in Zambian Pub...
Enterprise Medical Imaging for Improved Radiological Workflows in Zambian Pub...Enterprise Medical Imaging for Improved Radiological Workflows in Zambian Pub...
Enterprise Medical Imaging for Improved Radiological Workflows in Zambian Pub...
 
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
 
Enterprise Medical Imaging in Public Health Facilities in Zambia: Towards a U...
Enterprise Medical Imaging in Public Health Facilities in Zambia: Towards a U...Enterprise Medical Imaging in Public Health Facilities in Zambia: Towards a U...
Enterprise Medical Imaging in Public Health Facilities in Zambia: Towards a U...
 
Enterprise Medical Imaging in the Global South: Challenges and Opportunities
Enterprise Medical Imaging in the Global South: Challenges and OpportunitiesEnterprise Medical Imaging in the Global South: Challenges and Opportunities
Enterprise Medical Imaging in the Global South: Challenges and Opportunities
 
Factors Influencing Co-Creation of Open Education Resources Using Learning Ob...
Factors Influencing Co-Creation of Open Education Resources Using Learning Ob...Factors Influencing Co-Creation of Open Education Resources Using Learning Ob...
Factors Influencing Co-Creation of Open Education Resources Using Learning Ob...
 
DRGS OJS Training: Electronic Publishing Using Open Journal Systems
DRGS OJS Training: Electronic Publishing Using Open Journal SystemsDRGS OJS Training: Electronic Publishing Using Open Journal Systems
DRGS OJS Training: Electronic Publishing Using Open Journal Systems
 
OJS Training: Users and User Roles
OJS Training: Users and User RolesOJS Training: Users and User Roles
OJS Training: Users and User Roles
 
OJS Training: Journal Settings and Configuration
OJS Training: Journal Settings and ConfigurationOJS Training: Journal Settings and Configuration
OJS Training: Journal Settings and Configuration
 
OJS Training: Managing The Submission Process
OJS Training: Managing The Submission ProcessOJS Training: Managing The Submission Process
OJS Training: Managing The Submission Process
 
OJS Training: Creating and Managing Journal Issues
OJS Training: Creating and Managing Journal IssuesOJS Training: Creating and Managing Journal Issues
OJS Training: Creating and Managing Journal Issues
 
Improved Discoverability of Digital Objects in Institutional Repositories Usi...
Improved Discoverability of Digital Objects in Institutional Repositories Usi...Improved Discoverability of Digital Objects in Institutional Repositories Usi...
Improved Discoverability of Digital Objects in Institutional Repositories Usi...
 
Using Machine Learning Techniques for Solving Locally Relevant Problems
Using Machine Learning Techniques for Solving Locally Relevant ProblemsUsing Machine Learning Techniques for Solving Locally Relevant Problems
Using Machine Learning Techniques for Solving Locally Relevant Problems
 
Effective Ingestion of Digital Objects in Institutional Repositories Using Su...
Effective Ingestion of Digital Objects in Institutional Repositories Using Su...Effective Ingestion of Digital Objects in Institutional Repositories Using Su...
Effective Ingestion of Digital Objects in Institutional Repositories Using Su...
 
Improved Scholarly Communication Using Machine Learning
Improved Scholarly Communication Using Machine LearningImproved Scholarly Communication Using Machine Learning
Improved Scholarly Communication Using Machine Learning
 
Open Access Electronic Publishing for Increased Online Visibility: Tooling Ch...
Open Access Electronic Publishing for Increased Online Visibility: Tooling Ch...Open Access Electronic Publishing for Increased Online Visibility: Tooling Ch...
Open Access Electronic Publishing for Increased Online Visibility: Tooling Ch...
 
A Multi-Faceted Multi-Stakeholder Approach for Increased Visibility of ETDs i...
A Multi-Faceted Multi-Stakeholder Approach for Increased Visibility of ETDs i...A Multi-Faceted Multi-Stakeholder Approach for Increased Visibility of ETDs i...
A Multi-Faceted Multi-Stakeholder Approach for Increased Visibility of ETDs i...
 
A Multi-Faceted Multi-Stakeholder Approach for Increased Visibility of ETDs i...
A Multi-Faceted Multi-Stakeholder Approach for Increased Visibility of ETDs i...A Multi-Faceted Multi-Stakeholder Approach for Increased Visibility of ETDs i...
A Multi-Faceted Multi-Stakeholder Approach for Increased Visibility of ETDs i...
 
Post PhD Transition Experience: Successes and Challenges
Post PhD Transition Experience: Successes and ChallengesPost PhD Transition Experience: Successes and Challenges
Post PhD Transition Experience: Successes and Challenges
 

Último

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
SoniaTolstoy
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Último (20)

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 

Institutional Repository Single Sources of Truth

  • 1. Institutional Repository Single Sources of Truth Lighton Phiri <lighton.phiri@unza.zm> DataLab Research Group Department of Library & Information Science University of Zambia http://lis.unza.zm/~lightonphiri Mining Descriptive Metadata from Electronic Theses and Dissertations Digital Object Bitstreams
  • 2. 2June 9, 2020 About The DataLab Research Group at The University of Zambia ● The DataLab research group at The University of Zambia is composed of faculty staff and students—undergraduate and postgraduate—working in three main areas ○ Data Mining ○ Digital Libraries ○ Technology-Enhanced Learning http://datalab.unza.zm
  • 3. 3June 9, 2020 Outline ● Definitions and Key Concepts ● Contextualisation ● The Problem ● The Bigger Picture ● What We Think Can Work ● Past and Current Projects ● Data-Driven Ingestion of ETDs ● Conclusion and Future Work
  • 4. 4June 9, 2020 Outline ● Definitions and Key Concepts ● Contextualisation ● The Problem ● The Bigger Picture ● What We Think Can Work ● Past and Current Projects ● Data-Driven Ingestion of ETDs ● Conclusion and Future Work
  • 5. 5June 9, 2020 Institutional Repositories are a Specialised Type of Digital Library Management Systems
  • 6. 6June 9, 2020 Digital Objects are Composed of an Identifier, Bitstream and Metadata ● Identifier provides local and global uniqueness to objects ● Metadata provides auxiliary information about objects ● Bitstreams makeup the content associated with the objects http://www.dlib.org/dlib/July95/07arms.html
  • 7. 7June 9, 2020 Ingestion of Digital Objects into IRs Involves Numerous Moving Parts ● Digital object bitstreams need to be verified ● Digital object metadata needs to be prepared ○ Preparing metadata is time consuming and error-prone since data needs to be encoded using specific metadata schemes: e.g Dublin Core or ETD-ms ● Multi-step IR workflows http://open.uct.ac.za/handle/11427/29435
  • 8. 8June 9, 2020 Supervised Machine Learning 101: Labeled Data is Used to Train a Prediction Model Gmail Spam Detection ● Model implemented by training estimators using using labelled data ● f(x)—Spam/Not Spam classification ○ x—Email address, subject, email content ○ Labels—Mailbox user tagging; Gmail automatic detection
  • 9. 9June 9, 2020 Outline ● Definitions and Key Concepts ● Contextualisation ● The Problem ● The Bigger Picture ● What We Think Can Work ● Past and Current Projects ● Data-Driven Ingestion of ETDs ● Conclusion and Future Work
  • 10. 10June 9, 2020 The Online Visibility of Research Output Influences Global Institution Rankings (1/2) http://www.webometrics.info
  • 11. 11June 9, 2020 The Online Visibility of Research Output Influences Global Institution Rankings (1/2) http://www.webometrics.info
  • 12. 12June 9, 2020 The Online Visibility of Research Output Influences Global Institution Rankings (2/2) http://www.webometrics.info
  • 13. 13June 9, 2020 The Online Visibility of Research Output Influences Global Institution Rankings (2/2) http://www.webometrics.info
  • 14. 14June 9, 2020 The Online Visibility of Research Output Influences Global Institution Rankings (2/2) http://www.webometrics.info
  • 15. 15June 9, 2020 Google Scholar Extraction: The University of Zambia 2018 Scholarly Output Distribution
  • 16. 16June 9, 2020 Google Scholar Extraction: The University of Zambia 2018 Scholarly Output Distribution
  • 17. 17June 9, 2020 Google Scholar Extraction: The University of Zambia 2018 Scholarly Output Distribution ● 523 publications against ~854 faculty staff ● ONLY 476 (55.7%) faculty staff with online publications
  • 18. 18June 9, 2020 Google Scholar Extraction: The University of Zambia 2018 Scholarly Output Impact
  • 19. 19June 9, 2020 Google Scholar Extraction: The University of Zambia 2018 Scholarly Output Impact
  • 20. 20June 9, 2020 Google Scholar Extraction: The University of Zambia 2018 Scholarly Output Impact 30% of faculty staff research with ZERO impact
  • 21. 21June 9, 2020 Outline ● Definitions and Key Concepts ● Contextualisation ● The Problem ● The Bigger Picture ● What We Think Can Work ● Past and Current Projects ● Data-Driven Ingestion of ETDs ● Conclusion and Future Work
  • 22. 22June 9, 2020 Challenges With Ingestions of “ETDs” into The UNZA IR ● Ingestion of ETDs is not timely ● Huge time gaps between submission and eventual ingestion of ETDs ● ETDs have missing and incorrect descriptive metadata Phiri, L. (2018) “Towards Increased Online Visibility of Scholarly Research Output in Zambia”. URL: http://lis.unza.zm/archive/handle/123456789/227
  • 23. 23June 9, 2020 Outline ● Definitions and Key Concepts ● Contextualisation ● The Problem ● The Bigger Picture ● What We Think Can Work ● Past and Current Projects ● Data-Driven Ingestion of ETDs ● Conclusion and Future Work
  • 24. 24June 9, 2020 Towards a Zambia National Electronic Theses and Dissertations Portal (1/2) ● Works underway to deploy a national ETD portal ○ IRs implement open standards and protocols that facilitate harvesting of digital objects ● Stakeholders required to harmonise curation http://lis.unza.zm/portal
  • 25. 25June 9, 2020 Towards a Zambia National Electronic Theses and Dissertations Portal (2/2) http://www.hea.org.zm ● Ideally the portal is aimed at providing central access to all ETDs produced by HEIs in Zambia ○ Source of knowledge for not-for-profit organisations and policy makers ○ Source of knowledge for taxpayers—entities that fund research
  • 26. 26June 9, 2020 Outline ● Definitions and Key Concepts ● Contextualisation ● The Problem ● The Bigger Picture ● What We Think Can Work ● Past and Current Projects ● Data-Driven Ingestion of ETDs ● Conclusion and Future Work
  • 27. 27June 9, 2020 A Multi-Faceted Multi-Stakeholder Approach for Increased Visibility of ETDs (1/2) ● Experimentation with various approaches and techniques ○ System-driven approaches ○ Data-driven approaches ○ People-driven approaches ● Involve key stakeholders in order to mainstream the implementation of IRs in Zambian HEIs ○ Knowledge exchange with HEIs in Zambia
  • 28. 28June 9, 2020 A Multi-Faceted Multi-Stakeholder Approach for Increased Visibility of ETDs (2/2)
  • 29. 29June 9, 2020 Outline ● Definitions and Key Concepts ● Contextualisation ● The Problem ● The Bigger Picture ● What We Think Can Work ● Past and Current Projects ● Data-Driven Ingestion of ETDs ● Conclusion and Future Work
  • 30. 30June 9, 2020 Subject Specific Institutional Repositories as Potential Solutions for Delayed Ingestions ● Is it feasible to use subject institutional repositories at The UNZA? ○ Subject IR setup in the Department of Library and Information Science ○ Subject IR scored a mean SUS score of 65.5—”Okay-> Good” Mbewe, M., Moonga, H., Mwewa, M., Sikazindu, N., and Mwale, N. (2018) “Investigating the Feasibility of Using Subject IRs: A Case Study at The UNZA”. URL: http://lis.unza.zm/archive/handle/123456789/189 http://lis.unza.zm/archive
  • 31. 31June 9, 2020 Identification of Subject Controlled Vocabularies Should Involve Stakeholders ● What is the effect of integrating controlled vocabularies sets in IRs? ○ Sanbox IR integrated with LCSH vocabulary set ○ Mean SUS score computed ○ 68.6—With Vocabularies ○ 65.1—No Vocabularies Chipangila, B., Liswaniso, E., Mawila, A., Mwanza, P., and Nawila, D. (2019) “Effectiveness of Integrating Controlled Subject Vocabulary Sets in the UNZA IR”. URL: http://lis.unza.zm/archive/handle/123456789/2236
  • 32. 32June 9, 2020 Effective Workflows for Ingestion of Electronic Theses and Dissertations ● What workflows would be effective at reducing the turnaround time between submission and ingestions of ETDs ○ Stakeholders involved in submission workflow of ETDs Banda, A. (2019--Present) “Investigating the Workflows Involved in the Ingestion of ETD into IRs”. Work-in-Progresss
  • 33. 33June 9, 2020 Improved Electronic Theses and Dissertation Metadata Quality ● How can high-quality ETD matadata be ingested into IRs? ● How can submission workflows be modified to facilitate ingestion of ETD-ms encoded metadata? Banda, M., Chinyama, A., Maambo, H., Mulomba, H., and Mwitwa, K. (2020--Present) “Improved Electronic Theses and Dissertation Metadata Quality”. Work-in-Progresss http://www.ndltd.org/standards/metadata
  • 34. 34June 9, 2020 Outline ● Definitions and Key Concepts ● Contextualisation ● The Problem ● The Bigger Picture ● What We Think Can Work ● Past and Current Projects ● Data-Driven Ingestion of ETDs ● Conclusion and Future Work
  • 35. 35June 9, 2020 Timely and Consistent Ingestion of ETDs Can be Facilitated by Automatic Classification ● Implementation of classification models to automatically classify IR digital objects using the minimum possible input from graduate students: “The ETD Manuscript” ○ The ETD manuscript bitstream is considered the “single source of truth” ○ Metadata prepared by staff that work with IR potentially have inconsistencies Phiri, L. (in press) “Automatic Classification of Digital Objects for Improved Metadata Quality of ETDs”. International Journal of Metadata, Semantics and Ontologies
  • 36. 36June 9, 2020 Three Classification Models Implemented for Classifying ETD Type, Subject and Collection ● Text features extracted from a set of core bitstream portions—ETD Title, ETD Abstract, ETD Title Page and ETD pages—to classify ETD manuscripts ETD Type ETD Subjects IR Collection
  • 37. 37June 9, 2020 Text Features are Extracted from Single Source of Truth—Bitstream ● Textual content mined from PDF manuscripts ○ Cover/title pages ○ Preliminary pages ● Textual content mined from metadata for training ● PDF document metadata ● Curated datasets from external repositories
  • 38. 38June 9, 2020 Text on Title Pages Exhibit Characteristics that Enable ETD Classification ● Textual content mined from PDF manuscripts ○ Cover/title pages ○ Preliminary pages ● Textual content mined from metadata for training ● PDF document metadata ● Curated datasets from external repositories
  • 39. 39June 9, 2020 Descriptive Dublin Core Encoded Metadata Used for Training Classifiers ● Textual content mined from PDF manuscripts ○ Cover/title pages ○ Preliminary pages ● Textual content mined from metadata for training ● PDF document metadata ● Curated datasets from external repositories
  • 40. 40June 9, 2020 PDF Metadata Provide Auxiliary Metadata Information Such as Total Pages ● Textual content mined from PDF manuscripts ○ Cover/title pages ○ Preliminary pages ● Textual content mined from metadata for training ● PDF document metadata ● Curated datasets from external repositories
  • 41. 41June 9, 2020 Metadata was Harvested Using OAI-PMH, Bistreams were Harvested Using OAI-ORE ● OAI-PMH used to harvest all ETD descriptive metadata elements ● OAI-ORE used to harvest all ETD PDF documents
  • 42. 42June 9, 2020 The Models are Reasonably Accuracy to be Integrated with IR ● ETD Type—98.1% ● ETD Collection— 81.1% ● ETD Subjects—81.7% ● The models would still need to be incorporated into an application that requires “some” human intervention
  • 43. 43June 9, 2020 Models are Deployed as Flask-Based API Endpoints (1/2) https://github.com/lightonphiri/etd_autoclassifier
  • 44. 44June 9, 2020 Models are Deployed as Flask-Based API Endpoints (2/2) https://datalab-apis.herokuapp.com/api/collection
  • 45. 45June 9, 2020 Outline ● Definitions and Key Concepts ● Contextualisation ● The Problem ● The Bigger Picture ● What We Think Can Work ● Past and Current Projects ● Data-Driven Ingestion of ETDs ● Conclusion and Future Work
  • 46. 46June 9, 2020 Formatting Guidelines for ETD Metadata Harvested by National ETD Portal http://lis.unza.zm/portal
  • 47. 47June 9, 2020 Automatic ETD-ms Metadata Generation http://www.ndltd.org/standards/metadata
  • 48. 48June 9, 2020 Effective Workflows for Ingestion of Electronic Theses and Dissertations ● What is the feasibility of implementing machine learning models for automatically classifying IR objects? ○ This work goes beyond ETDs to include other digital object types ● How can downstream services be integrated with the models? M’sendo, R. (2019--Present) “Mult-Faceted Automatic Classification of Institutional Repository Digital Objects”. Work-in-Progress
  • 49. 49June 9, 2020 Online Visibility of Research can Potentially Change the Global Reputation of HEIs http://www.webometrics.info
  • 50. 50June 9, 2020 Q & A Session ● Comments, concerns and complaints?
  • 51. [1] Phiri, L. (2018). Research Visibility in the Global South: Towards Increased Online Visibility of Scholarly Research Output in Zambia. IEEE International Conference in Information and Communication Technologies. [2] Phiri, L. (2020). A Multi-Faceted Multi-Stakeholder Approach for Increased Visibility of ETDs in Zambia. Cadernos BAD, (1). https://doi.org/10.1017/S0269888910000032 [3] Arms, W. Y. (1995). Key concepts in the architecture of the digital library. D-lib Magazine, 1(1). Bibliography