SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Non-parametric Subject Prediction
Shenghui Wang 1 Rob Koopman 1 Gwenn Englebienne 2
1OCLC Research
2University of Twente
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 1 / 24
What is subject indexing?
Subject indexing is the act of describing or classifying a document by
index terms or other symbols in order to indicate what the document
is about, to summarize its content or to increase its findability.
Index terms or classes are normally from established knowledge
organization systems and classification systems that easily contain
tens or hundreds of thousands of terms or codes
Manual subject indexing is time consuming and not scalable anymore
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 2 / 24
Automatic subject indexing
The goal of automatic subject indexing is to assign a document a
small subset of relevant subject labels (subjects), taken from tens or
hundreds of thousands target labels.
Existing approaches categorised at ISKO Encyclopedia of Knowledge
Organization:
Text categorisation
document clustering
document classification
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 3 / 24
From a machine learning perspective
Different from binary or multi-class classification problems, it is a
multi-label classification problem where the target labels are neither
independent nor mutually exclusive
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 4 / 24
From a machine learning perspective
Different from binary or multi-class classification problems, it is a
multi-label classification problem where the target labels are neither
independent nor mutually exclusive
It is a form of eXtreme Multi-label Text Classification (XMTC)
problem, where the prediction space normally consists of hundreds of
thousands to millions of labels.
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 4 / 24
From a machine learning perspective
Different from binary or multi-class classification problems, it is a
multi-label classification problem where the target labels are neither
independent nor mutually exclusive
It is a form of eXtreme Multi-label Text Classification (XMTC)
problem, where the prediction space normally consists of hundreds of
thousands to millions of labels.
Challenges: data sparsity and scalability
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 4 / 24
Existing XMTC approaches
1-vs-All classifiers (e.g. Slice, Parabel, PD-Sparse)
Tree-based ensemble methods(e.g. SwiftXML, FastXML)
Target-embedding methods(e.g. DEFRAG, SLEEC)
Deep learning (e.g. fastText Learn Tree, XML-CNN)
http://manikvarma.org/downloads/XC/XMLRepository.html
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 5 / 24
Our two-step approach: first embed, then predict
Step 1 Embedding subjects and documents in the same semantic
space
Step 2 Leveraging similarities for subject prediction
Naive similarity based method
Nonparametric (neighbourhood based) method
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 6 / 24
Word and document embedding
Words and documents are represented as vectors of numbers that
represent their meaning.
Semantically similar words and documents are mapped to nearby
points
A desirable property: computable similarity
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 7 / 24
Embedding approaches
Global co-occurrence count-based methods (e.g. LSA)
Based on statistics of how often some word co-occurs with its
neighbour words in a large text corpus
Dimensionality reduction methods
Local context predictive methods (e.g. word2vec)
Learning to predict a word from its neighbours or vice versa
More complex and powerful deep learning models for embedding words,
sentences and documents
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 8 / 24
Weakness of deep learning
Pre-trained word embeddings may not capture domain-specific
semantics
Medical information retrieval, special collection exploration, etc.
Computationally expensive to train from scratch
Often requiring GPUs
Complex optimisation, difficult to find the optimal hyperparameter
settings
Standard benchmarks and evaluation methods often do not answer
practical needs
So what is a good cost function?
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 9 / 24
Ariadne semantic embedding
Something fast and discrimitive?
1
Koopman, R., Wang, S., Englebienne, G.: Fast and Discriminative Semantic
Embedding. In Proceedings of the 13th International Conference on Computational
Semantics - Long Papers. pp. 235-246. ACL, Gothenburg, Sweden (May 2019)
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 10 / 24
Ariadne semantic embedding
Fast random projection to compute term embeddings
Orthogonal projection to increase discrimination
Term weight assignment for better document embeddings
1
1
Koopman, R., Wang, S., Englebienne, G.: Fast and Discriminative Semantic
Embedding. In Proceedings of the 13th International Conference on Computational
Semantics - Long Papers. pp. 235-246. ACL, Gothenburg, Sweden (May 2019)
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 10 / 24
Fast Random Projection
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 11 / 24
Discriminative
We do not use heuristics to filter out stop words
Task-specific, dataset-specific, . . .
Most terms end up being similar to average language vector va
va corresponds closely to stop words
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 12 / 24
Discriminative
We do not use heuristics to filter out stop words
Task-specific, dataset-specific, . . .
Most terms end up being similar to average language vector va
va corresponds closely to stop words
Our solution:
Cancel out the average language vector from the embeddings
(Orthogonal Projection)
Use the original similarity to va inform the importance of terms (Term
weight assignment)
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 12 / 24
Orthogonal Projection
Vectors are projected on the orthogonal hyperplane to an average
language vector va: v∗
t = vt − (vt · va)va
Removing the stop-wordiness improves the discriminating power
Projected vectors are normalized
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 13 / 24
Document embedding: term weights
The more similar to the average vector, the less weight it gets
wt = 1 − cos(vt, va)
No need to remove stop words first
Crucial to get distinctive document embeddings (weighted average)
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 14 / 24
Automatic Subject prediction
Naive method based on subject-document similarity
A document is likely to be indexed with subjects that are most related
to it (i.e. with highest cosine similarities)
Non-Parametric (neighbourhood based) method
A document is likely to be indexed similarly to the documents that are
similar to it
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 15 / 24
Dataset
Metadata of one million Medline articles with abstract
The training set contains 147,837 unique MESH subjects
Each article is indexed with, on average, 16 MeSH subjects
222,135 subjects are used to index less than 10 article
95,218 subjects are used to index only one single article
7 subjects are each used to index more than 100K articles
Humans indexes nearly 58% of the whole corpus
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 16 / 24
Evaluation
We evaluated our methods with fastText
The same server with 2 Intel Xeon Silver 4109T 8-core processors and
384GB memory
The minimum document frequency is 10
The dimensionality of embeddings is 256
10,000 articles for testing
Measure individual precision/recall of top N predictions then take the
average2
P@n = 1
n l∈rn(ˆy) yl
R@n = 1
y 0 l∈rn(ˆy) yl
2
More metrics can be found in the paper
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 17 / 24
Results
0 20 40 60 80 100
Top n
0.0
0.2
0.4
0.6
0.8
1.0
Average P@n
fastText
Ariadne
Ariadne + NPSP
0 20 40 60 80 100
Top n
0.0
0.2
0.4
0.6
0.8
1.0
Average R@n
fastText
Ariadne
Ariadne + NPSP
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 18 / 24
A closer look
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 19 / 24
Compare Ariadne, Ariadne+NPSP with fastText
Actual MeSH subjects
(alphabetical order)
dt Ariadne dt Ariadne + NPSP dt fastText dt
Acrylic Resins 920 Lenses, Intraocular 434 Humans 579975 Humans 579975
Aged 118655 Lens Implantation,
Intraocular
317 Male 336647 Female 328885
Aged, 80 and over 40642 Phacoemulsification 225 Female 328885 Middle Aged 168714
Capsulorhexis 24 Visual Acuity 2026 Aged 118655 Male 336647
Female 328885 Lens Capsule, Crystalline/
Surgery
79 Lenses, Intraocular 434 Risk Factors 34538
Humans 579975 Lens Capsule, Crystalline/
Pathology
65 Lens Implantation,
intraocular
317 Adult 194200
Laser therapy* 847 Visual Acuity/Physiology 949 Visual Acuity 2026 Aged 118655
Lens Capsule, Crystalline/
Pathology
65 Lens Implantation,
Intraocular/Methods
90 Phacoemulsification 225 Retrospective Studies 32642
Lens Capsule, Crystalline/
Surgery*
79 Cataract Extraction/Methods 136 Middle Aged 168714 Follow Up Studies 27911
Lens Implantation, Intraocular 317 Phacoemulsification/Methods 106 Prospective Studies 25714 Aged 80 and over 40642
Male 336647 Retrospective Studies 32642 Lens Capsule, Crystalline/
Surgery
79 Incidence 11468
Middle Aged 168714 Aged 118655 Follow Up Studies 27911 Adolescent 75361
Phacoemulsification* 225 Prospective Studies 25714 Acrylic Resins 920 Young Adult 27991
Polymethyl Methacrylate 392 Aged 80 and Over 40642 Lens Capsule, Crystalline/
Pathology
65 Prospective Studies 25714
Postoperative Complications/
Pathology
351 Middle Aged 168714 Postoperative Complications 3961 Time Factors 50339
Postoperative Complications/
Surgery*
823 Male 336647 Aged 80 and Over 40642 Logistic Models 7258
Probability 2914 Cataract Extraction 350 Cataract Etiology 218 Case Control Studies 13306
Retrospective Studies 32642 Lenses, Intraocular/ Adverse
Effects
62 Adult 194200 Cohort Studies 12275
Risk Factors 34538 Adolescent 75361 Retrospective Studies 32642 Treatment Outcome 40496
Sex Factors 10203 Female 328885 Visual Acuity/Physiology 949 Child 53738
Silicone Elastomers 294 Intraocular Pressure 712 Treatment Outcome 40496 Survival Rate 7975
Survival Analysis 7046 Capsulorhexis 24 Cataract Extraction 350 Prognosis 17160
Visual Acuity 2026 Pseudophakia Physiopathology 56 Lens Implantation,
Intraocular/ Adverse Effects
45 Risk Assessment 9271
Adult 194200 Silicone Elastomers 294 Reoperation 3355
Risk Factors 34538 Prosthesis Design 2096 Proportional Hazards Models 3403
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 20 / 24
Predictions vs subjects’ document frequencies
0 5 10 15 20
document frequency (log2)
10000
5000
0
5000
10000
15000
#predictions
8491464
2583
4155
6021
7528
8144
6742
44352224
1066
407
159
69
25
6 4 2
3
1
Top 10
Targets
fastText
Ariadne
Ariadne + NPSP
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 21 / 24
Summary and future work
Automatic subject prediction is a form of eXtreme Mult-label Text
Classification problem
We proposed an embedding-based approach for subject prediction
We need more appropriate evaluation metrics
Domain experts need to get involved
Explore deep learning methods for XMTC
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 22 / 24
Thank you
Non-parametric Subject Prediction
Shenghui Wang 1 Rob Koopman 1 Gwenn Englebienne 2
1OCLC Research
2University of Twente
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 23 / 24
STS scores and train times of different methods
Method Dev Test Dataset used Train time
Doc2Vec (PV-DBOW) 72.2 64.9 AP-NEWS Unknown
31.0 27.0 S.E. Wikipedia 23m
53.7 49.8 MEDLINE 2h3m
FastText 65.3 53.6 Wikipedia Unknown
48.6 38.9 S.E. Wikipedia 51m
52.8 41.1 MEDLINE 3h4m
Sent2Vec 78.7 75.5 Twitter Unknown
68.8 62.2 S.E. Wikipedia 18m
65.4 55.0 MEDLINE 3h10m
Our method 75.1 64.6 S.E. Wikipedia 17s
73.6 58.9 MEDLINE 43s
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 24 / 24

Mais conteúdo relacionado

Mais procurados

kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Ontology engineering: Ontology alignment
Ontology engineering: Ontology alignmentOntology engineering: Ontology alignment
Ontology engineering: Ontology alignmentGuus Schreiber
 
taghelper-final.doc
taghelper-final.doctaghelper-final.doc
taghelper-final.docbutest
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlescsandit
 
AINL 2016: Castro, Lopez, Cavalcante, Couto
AINL 2016: Castro, Lopez, Cavalcante, CoutoAINL 2016: Castro, Lopez, Cavalcante, Couto
AINL 2016: Castro, Lopez, Cavalcante, CoutoLidia Pivovarova
 
9789513932527[1]
9789513932527[1]9789513932527[1]
9789513932527[1]Eki Laitila
 
Prediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNNPrediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNNIJECEIAES
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
What to read next? Challenges and Preliminary Results in Selecting Represen...
What to read next? Challenges and  Preliminary Results in Selecting  Represen...What to read next? Challenges and  Preliminary Results in Selecting  Represen...
What to read next? Challenges and Preliminary Results in Selecting Represen...MOVING Project
 
Experimenting with eXtreme Design (EKAW2010)
Experimenting with eXtreme Design (EKAW2010)Experimenting with eXtreme Design (EKAW2010)
Experimenting with eXtreme Design (EKAW2010)evabl444
 
A survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationA survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationijnlc
 
Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...inscit2006
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextUniversity of Bari (Italy)
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Isabelle Augenstein
 
A Survey on Unsupervised Graph-based Word Sense Disambiguation
A Survey on Unsupervised Graph-based Word Sense DisambiguationA Survey on Unsupervised Graph-based Word Sense Disambiguation
A Survey on Unsupervised Graph-based Word Sense DisambiguationElena-Oana Tabaranu
 
20051128.doc
20051128.doc20051128.doc
20051128.docbutest
 
Helping Students to Learn Matehmatics Beyond LMS
Helping Students to Learn Matehmatics Beyond LMSHelping Students to Learn Matehmatics Beyond LMS
Helping Students to Learn Matehmatics Beyond LMSMartin Homik
 

Mais procurados (19)

D1802023136
D1802023136D1802023136
D1802023136
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Ontology engineering: Ontology alignment
Ontology engineering: Ontology alignmentOntology engineering: Ontology alignment
Ontology engineering: Ontology alignment
 
taghelper-final.doc
taghelper-final.doctaghelper-final.doc
taghelper-final.doc
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articles
 
AINL 2016: Castro, Lopez, Cavalcante, Couto
AINL 2016: Castro, Lopez, Cavalcante, CoutoAINL 2016: Castro, Lopez, Cavalcante, Couto
AINL 2016: Castro, Lopez, Cavalcante, Couto
 
9789513932527[1]
9789513932527[1]9789513932527[1]
9789513932527[1]
 
Prediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNNPrediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNN
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
What to read next? Challenges and Preliminary Results in Selecting Represen...
What to read next? Challenges and  Preliminary Results in Selecting  Represen...What to read next? Challenges and  Preliminary Results in Selecting  Represen...
What to read next? Challenges and Preliminary Results in Selecting Represen...
 
Experimenting with eXtreme Design (EKAW2010)
Experimenting with eXtreme Design (EKAW2010)Experimenting with eXtreme Design (EKAW2010)
Experimenting with eXtreme Design (EKAW2010)
 
A survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationA survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classification
 
Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
 
A Survey on Unsupervised Graph-based Word Sense Disambiguation
A Survey on Unsupervised Graph-based Word Sense DisambiguationA Survey on Unsupervised Graph-based Word Sense Disambiguation
A Survey on Unsupervised Graph-based Word Sense Disambiguation
 
20051128.doc
20051128.doc20051128.doc
20051128.doc
 
Helping Students to Learn Matehmatics Beyond LMS
Helping Students to Learn Matehmatics Beyond LMSHelping Students to Learn Matehmatics Beyond LMS
Helping Students to Learn Matehmatics Beyond LMS
 

Semelhante a Non-parametric Subject Prediction Using Semantic Embeddings

A stochastic algorithm for solving the posterior inference problem in topic m...
A stochastic algorithm for solving the posterior inference problem in topic m...A stochastic algorithm for solving the posterior inference problem in topic m...
A stochastic algorithm for solving the posterior inference problem in topic m...TELKOMNIKA JOURNAL
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Project Presentation
Project PresentationProject Presentation
Project Presentationbutest
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
AN AUTOMATED MULTIPLE-CHOICE QUESTION GENERATION USING NATURAL LANGUAGE PROCE...
AN AUTOMATED MULTIPLE-CHOICE QUESTION GENERATION USING NATURAL LANGUAGE PROCE...AN AUTOMATED MULTIPLE-CHOICE QUESTION GENERATION USING NATURAL LANGUAGE PROCE...
AN AUTOMATED MULTIPLE-CHOICE QUESTION GENERATION USING NATURAL LANGUAGE PROCE...kevig
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Reviewchangedaeoh
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsLisa Graves
 
NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241Urjit Patel
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsDimitris Kontokostas
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniquesijnlc
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMeMadrid network
 
Survey on contrastive self supervised l earning
Survey on contrastive self supervised l earningSurvey on contrastive self supervised l earning
Survey on contrastive self supervised l earningAnirudh Ganguly
 
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Ansgar Scherp
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
introduction to machine learning and nlp
introduction to machine learning and nlpintroduction to machine learning and nlp
introduction to machine learning and nlpMahmoud Farag
 

Semelhante a Non-parametric Subject Prediction Using Semantic Embeddings (20)

A stochastic algorithm for solving the posterior inference problem in topic m...
A stochastic algorithm for solving the posterior inference problem in topic m...A stochastic algorithm for solving the posterior inference problem in topic m...
A stochastic algorithm for solving the posterior inference problem in topic m...
 
G04124041046
G04124041046G04124041046
G04124041046
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
Doc format.
Doc format.Doc format.
Doc format.
 
AN AUTOMATED MULTIPLE-CHOICE QUESTION GENERATION USING NATURAL LANGUAGE PROCE...
AN AUTOMATED MULTIPLE-CHOICE QUESTION GENERATION USING NATURAL LANGUAGE PROCE...AN AUTOMATED MULTIPLE-CHOICE QUESTION GENERATION USING NATURAL LANGUAGE PROCE...
AN AUTOMATED MULTIPLE-CHOICE QUESTION GENERATION USING NATURAL LANGUAGE PROCE...
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And Applications
 
NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241NLP_Project_Paper_up276_vec241
NLP_Project_Paper_up276_vec241
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniques
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
 
Survey on contrastive self supervised l earning
Survey on contrastive self supervised l earningSurvey on contrastive self supervised l earning
Survey on contrastive self supervised l earning
 
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
 
Seminar dm
Seminar dmSeminar dm
Seminar dm
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
introduction to machine learning and nlp
introduction to machine learning and nlpintroduction to machine learning and nlp
introduction to machine learning and nlp
 

Mais de Shenghui Wang

Our journey with semantic embedding
Our journey with semantic embeddingOur journey with semantic embedding
Our journey with semantic embeddingShenghui Wang
 
Linking entities via semantic indexing
Linking entities via semantic indexingLinking entities via semantic indexing
Linking entities via semantic indexingShenghui Wang
 
Semantic indexing for KOS
Semantic indexing for KOSSemantic indexing for KOS
Semantic indexing for KOSShenghui Wang
 
Contextualization of topics - browsing through terms, authors, journals and c...
Contextualization of topics - browsing through terms, authors, journals and c...Contextualization of topics - browsing through terms, authors, journals and c...
Contextualization of topics - browsing through terms, authors, journals and c...Shenghui Wang
 
Exploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadataExploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadataShenghui Wang
 
Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...Shenghui Wang
 
Learning Concept Mappings from Instance Similarity
Learning Concept Mappings from Instance SimilarityLearning Concept Mappings from Instance Similarity
Learning Concept Mappings from Instance SimilarityShenghui Wang
 
Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...Shenghui Wang
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Shenghui Wang
 
What is concept dirft and how to measure it?
What is concept dirft and how to measure it?What is concept dirft and how to measure it?
What is concept dirft and how to measure it?Shenghui Wang
 
Study concept drift in political ontologies
Study concept drift in political ontologiesStudy concept drift in political ontologies
Study concept drift in political ontologiesShenghui Wang
 

Mais de Shenghui Wang (13)

Our journey with semantic embedding
Our journey with semantic embeddingOur journey with semantic embedding
Our journey with semantic embedding
 
Linking entities via semantic indexing
Linking entities via semantic indexingLinking entities via semantic indexing
Linking entities via semantic indexing
 
Semantic indexing for KOS
Semantic indexing for KOSSemantic indexing for KOS
Semantic indexing for KOS
 
Contextualization of topics - browsing through terms, authors, journals and c...
Contextualization of topics - browsing through terms, authors, journals and c...Contextualization of topics - browsing through terms, authors, journals and c...
Contextualization of topics - browsing through terms, authors, journals and c...
 
Exploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadataExploring a world of networked information built from free-text metadata
Exploring a world of networked information built from free-text metadata
 
Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...Ariadne's Thread -- Exploring a world of networked information built from fre...
Ariadne's Thread -- Exploring a world of networked information built from fre...
 
Learning Concept Mappings from Instance Similarity
Learning Concept Mappings from Instance SimilarityLearning Concept Mappings from Instance Similarity
Learning Concept Mappings from Instance Similarity
 
Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning
 
What is concept dirft and how to measure it?
What is concept dirft and how to measure it?What is concept dirft and how to measure it?
What is concept dirft and how to measure it?
 
ICA Slides
ICA SlidesICA Slides
ICA Slides
 
ECCS 2010
ECCS 2010ECCS 2010
ECCS 2010
 
Study concept drift in political ontologies
Study concept drift in political ontologiesStudy concept drift in political ontologies
Study concept drift in political ontologies
 

Último

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 

Último (20)

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 

Non-parametric Subject Prediction Using Semantic Embeddings

  • 1. Non-parametric Subject Prediction Shenghui Wang 1 Rob Koopman 1 Gwenn Englebienne 2 1OCLC Research 2University of Twente Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 1 / 24
  • 2. What is subject indexing? Subject indexing is the act of describing or classifying a document by index terms or other symbols in order to indicate what the document is about, to summarize its content or to increase its findability. Index terms or classes are normally from established knowledge organization systems and classification systems that easily contain tens or hundreds of thousands of terms or codes Manual subject indexing is time consuming and not scalable anymore Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 2 / 24
  • 3. Automatic subject indexing The goal of automatic subject indexing is to assign a document a small subset of relevant subject labels (subjects), taken from tens or hundreds of thousands target labels. Existing approaches categorised at ISKO Encyclopedia of Knowledge Organization: Text categorisation document clustering document classification Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 3 / 24
  • 4. From a machine learning perspective Different from binary or multi-class classification problems, it is a multi-label classification problem where the target labels are neither independent nor mutually exclusive Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 4 / 24
  • 5. From a machine learning perspective Different from binary or multi-class classification problems, it is a multi-label classification problem where the target labels are neither independent nor mutually exclusive It is a form of eXtreme Multi-label Text Classification (XMTC) problem, where the prediction space normally consists of hundreds of thousands to millions of labels. Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 4 / 24
  • 6. From a machine learning perspective Different from binary or multi-class classification problems, it is a multi-label classification problem where the target labels are neither independent nor mutually exclusive It is a form of eXtreme Multi-label Text Classification (XMTC) problem, where the prediction space normally consists of hundreds of thousands to millions of labels. Challenges: data sparsity and scalability Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 4 / 24
  • 7. Existing XMTC approaches 1-vs-All classifiers (e.g. Slice, Parabel, PD-Sparse) Tree-based ensemble methods(e.g. SwiftXML, FastXML) Target-embedding methods(e.g. DEFRAG, SLEEC) Deep learning (e.g. fastText Learn Tree, XML-CNN) http://manikvarma.org/downloads/XC/XMLRepository.html Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 5 / 24
  • 8. Our two-step approach: first embed, then predict Step 1 Embedding subjects and documents in the same semantic space Step 2 Leveraging similarities for subject prediction Naive similarity based method Nonparametric (neighbourhood based) method Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 6 / 24
  • 9. Word and document embedding Words and documents are represented as vectors of numbers that represent their meaning. Semantically similar words and documents are mapped to nearby points A desirable property: computable similarity Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 7 / 24
  • 10. Embedding approaches Global co-occurrence count-based methods (e.g. LSA) Based on statistics of how often some word co-occurs with its neighbour words in a large text corpus Dimensionality reduction methods Local context predictive methods (e.g. word2vec) Learning to predict a word from its neighbours or vice versa More complex and powerful deep learning models for embedding words, sentences and documents Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 8 / 24
  • 11. Weakness of deep learning Pre-trained word embeddings may not capture domain-specific semantics Medical information retrieval, special collection exploration, etc. Computationally expensive to train from scratch Often requiring GPUs Complex optimisation, difficult to find the optimal hyperparameter settings Standard benchmarks and evaluation methods often do not answer practical needs So what is a good cost function? Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 9 / 24
  • 12. Ariadne semantic embedding Something fast and discrimitive? 1 Koopman, R., Wang, S., Englebienne, G.: Fast and Discriminative Semantic Embedding. In Proceedings of the 13th International Conference on Computational Semantics - Long Papers. pp. 235-246. ACL, Gothenburg, Sweden (May 2019) Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 10 / 24
  • 13. Ariadne semantic embedding Fast random projection to compute term embeddings Orthogonal projection to increase discrimination Term weight assignment for better document embeddings 1 1 Koopman, R., Wang, S., Englebienne, G.: Fast and Discriminative Semantic Embedding. In Proceedings of the 13th International Conference on Computational Semantics - Long Papers. pp. 235-246. ACL, Gothenburg, Sweden (May 2019) Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 10 / 24
  • 14. Fast Random Projection Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 11 / 24
  • 15. Discriminative We do not use heuristics to filter out stop words Task-specific, dataset-specific, . . . Most terms end up being similar to average language vector va va corresponds closely to stop words Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 12 / 24
  • 16. Discriminative We do not use heuristics to filter out stop words Task-specific, dataset-specific, . . . Most terms end up being similar to average language vector va va corresponds closely to stop words Our solution: Cancel out the average language vector from the embeddings (Orthogonal Projection) Use the original similarity to va inform the importance of terms (Term weight assignment) Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 12 / 24
  • 17. Orthogonal Projection Vectors are projected on the orthogonal hyperplane to an average language vector va: v∗ t = vt − (vt · va)va Removing the stop-wordiness improves the discriminating power Projected vectors are normalized Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 13 / 24
  • 18. Document embedding: term weights The more similar to the average vector, the less weight it gets wt = 1 − cos(vt, va) No need to remove stop words first Crucial to get distinctive document embeddings (weighted average) Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 14 / 24
  • 19. Automatic Subject prediction Naive method based on subject-document similarity A document is likely to be indexed with subjects that are most related to it (i.e. with highest cosine similarities) Non-Parametric (neighbourhood based) method A document is likely to be indexed similarly to the documents that are similar to it Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 15 / 24
  • 20. Dataset Metadata of one million Medline articles with abstract The training set contains 147,837 unique MESH subjects Each article is indexed with, on average, 16 MeSH subjects 222,135 subjects are used to index less than 10 article 95,218 subjects are used to index only one single article 7 subjects are each used to index more than 100K articles Humans indexes nearly 58% of the whole corpus Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 16 / 24
  • 21. Evaluation We evaluated our methods with fastText The same server with 2 Intel Xeon Silver 4109T 8-core processors and 384GB memory The minimum document frequency is 10 The dimensionality of embeddings is 256 10,000 articles for testing Measure individual precision/recall of top N predictions then take the average2 P@n = 1 n l∈rn(ˆy) yl R@n = 1 y 0 l∈rn(ˆy) yl 2 More metrics can be found in the paper Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 17 / 24
  • 22. Results 0 20 40 60 80 100 Top n 0.0 0.2 0.4 0.6 0.8 1.0 Average P@n fastText Ariadne Ariadne + NPSP 0 20 40 60 80 100 Top n 0.0 0.2 0.4 0.6 0.8 1.0 Average R@n fastText Ariadne Ariadne + NPSP Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 18 / 24
  • 23. A closer look Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 19 / 24
  • 24. Compare Ariadne, Ariadne+NPSP with fastText Actual MeSH subjects (alphabetical order) dt Ariadne dt Ariadne + NPSP dt fastText dt Acrylic Resins 920 Lenses, Intraocular 434 Humans 579975 Humans 579975 Aged 118655 Lens Implantation, Intraocular 317 Male 336647 Female 328885 Aged, 80 and over 40642 Phacoemulsification 225 Female 328885 Middle Aged 168714 Capsulorhexis 24 Visual Acuity 2026 Aged 118655 Male 336647 Female 328885 Lens Capsule, Crystalline/ Surgery 79 Lenses, Intraocular 434 Risk Factors 34538 Humans 579975 Lens Capsule, Crystalline/ Pathology 65 Lens Implantation, intraocular 317 Adult 194200 Laser therapy* 847 Visual Acuity/Physiology 949 Visual Acuity 2026 Aged 118655 Lens Capsule, Crystalline/ Pathology 65 Lens Implantation, Intraocular/Methods 90 Phacoemulsification 225 Retrospective Studies 32642 Lens Capsule, Crystalline/ Surgery* 79 Cataract Extraction/Methods 136 Middle Aged 168714 Follow Up Studies 27911 Lens Implantation, Intraocular 317 Phacoemulsification/Methods 106 Prospective Studies 25714 Aged 80 and over 40642 Male 336647 Retrospective Studies 32642 Lens Capsule, Crystalline/ Surgery 79 Incidence 11468 Middle Aged 168714 Aged 118655 Follow Up Studies 27911 Adolescent 75361 Phacoemulsification* 225 Prospective Studies 25714 Acrylic Resins 920 Young Adult 27991 Polymethyl Methacrylate 392 Aged 80 and Over 40642 Lens Capsule, Crystalline/ Pathology 65 Prospective Studies 25714 Postoperative Complications/ Pathology 351 Middle Aged 168714 Postoperative Complications 3961 Time Factors 50339 Postoperative Complications/ Surgery* 823 Male 336647 Aged 80 and Over 40642 Logistic Models 7258 Probability 2914 Cataract Extraction 350 Cataract Etiology 218 Case Control Studies 13306 Retrospective Studies 32642 Lenses, Intraocular/ Adverse Effects 62 Adult 194200 Cohort Studies 12275 Risk Factors 34538 Adolescent 75361 Retrospective Studies 32642 Treatment Outcome 40496 Sex Factors 10203 Female 328885 Visual Acuity/Physiology 949 Child 53738 Silicone Elastomers 294 Intraocular Pressure 712 Treatment Outcome 40496 Survival Rate 7975 Survival Analysis 7046 Capsulorhexis 24 Cataract Extraction 350 Prognosis 17160 Visual Acuity 2026 Pseudophakia Physiopathology 56 Lens Implantation, Intraocular/ Adverse Effects 45 Risk Assessment 9271 Adult 194200 Silicone Elastomers 294 Reoperation 3355 Risk Factors 34538 Prosthesis Design 2096 Proportional Hazards Models 3403 Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 20 / 24
  • 25. Predictions vs subjects’ document frequencies 0 5 10 15 20 document frequency (log2) 10000 5000 0 5000 10000 15000 #predictions 8491464 2583 4155 6021 7528 8144 6742 44352224 1066 407 159 69 25 6 4 2 3 1 Top 10 Targets fastText Ariadne Ariadne + NPSP Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 21 / 24
  • 26. Summary and future work Automatic subject prediction is a form of eXtreme Mult-label Text Classification problem We proposed an embedding-based approach for subject prediction We need more appropriate evaluation metrics Domain experts need to get involved Explore deep learning methods for XMTC Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 22 / 24
  • 27. Thank you Non-parametric Subject Prediction Shenghui Wang 1 Rob Koopman 1 Gwenn Englebienne 2 1OCLC Research 2University of Twente Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 23 / 24
  • 28. STS scores and train times of different methods Method Dev Test Dataset used Train time Doc2Vec (PV-DBOW) 72.2 64.9 AP-NEWS Unknown 31.0 27.0 S.E. Wikipedia 23m 53.7 49.8 MEDLINE 2h3m FastText 65.3 53.6 Wikipedia Unknown 48.6 38.9 S.E. Wikipedia 51m 52.8 41.1 MEDLINE 3h4m Sent2Vec 78.7 75.5 Twitter Unknown 68.8 62.2 S.E. Wikipedia 18m 65.4 55.0 MEDLINE 3h10m Our method 75.1 64.6 S.E. Wikipedia 17s 73.6 58.9 MEDLINE 43s Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 24 / 24