In the early 1990s, the term 'semantic' appeared in the context of text retrieval tools. However, from the very beginning of Information Retrieval as a research field (i.e. as computer-assisted identification of relevant documents), looking at the articles of Vannevar Bush (How we may think) or Luhn (The automatic creation of literature abstracts) in the 1940s and '50s, the idea of semantics was already there.
So where are we now in terms of semantics? The `latent semantic indexing` of the 1990s faded away, and the first decade of the millennium enthusiastically studied semantic web technologies. Now, in the second decade, `deep learning` is the new star. In this talk I will give a high-level overview of what has been done already, particularly in the context of the patent domain, what the main techniques are, and in which directions is the scientific community looking today. Ultimately, there will be no one answer to the question of 'What is semantic search?'. Instead, my aim is to empower the audience to ask the right questions next time somebody mentions the term.
Call Now â 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Â
II-SDV 2017: Semantic Search Jargon - A short Guide
1. Semantic Search Jargon â a short guide
Mihai Lupu
TU Wien / RSA Data Science
mihai.lupu@researchstudio.at
2. âSemanticâ
âȘ adjective
â dictionary.com: of, relating to, or arising from the different meanings of
words or other symbols
â Merriam-Webster: of or relating to the meanings of words and phrases
â Cambridge: connected with the meanings of words
â Oxford: connected with the meaning of words and sentences
7. The geometric metaphor of meaning
âMeanings are locations in a semantic
space, and semantic similarity is proximity
between the locationsâ
(Sahlgren, 2006)
9. and others
pure counting
term frequency
position in sentence
SMART
IDF
cosine similarity
and many more
195
196
197
198
199
200
201
202
from counting to predicting
Latent
Semantic
Analysis
Random
Indexing
WWW
appears
Semantic
Web
appears
Deep
Learning
Speech
Vision
NLP
IR
The Golden Age of
Artificial Intelligence Expert Systems,
Knowledge
bases (e.g. Cyc)
Inference
on billions
of tuples
on trillions
Probabilistic
models for IR
Language Models
10. where are we now?
âȘ Inference directly from text
âȘ [Bowman et al. 2016]
A man rides a bike on
a snow covered road
A man is outside
2 female babies
eating chips
Two female babies are
enjoying chips
A man in an apron
shopping at a market
A man in an apron is
preparing dinner
Model %
Accur
acy
Feature-based classifier 78.2
Previous SOTA sentence
encoder [Mou et al. 2016]
82.1
LSTN RNN sequence model 80.6
Tree LSTM 80.9
SPINN 83.2
SOTA (sentence pair
alignment model) [Parikh et
al. 2016]
86.8
11. where are we now?
âȘ Inference directly from text
âȘ [Bowman et al. 2016]
A man rides a bike on
a snow covered road
A man is outside
2 female babies
eating chips
Two female babies are
enjoying chips
A man in an apron
shopping at a market
A man in an apron is
preparing dinner
Model %
Accur
acy
Feature-based classifier 78.2
Previous SOTA sentence
encoder [Mou et al. 2016]
82.1
LSTN RNN sequence model 80.6
Tree LSTM 80.9
SPINN 83.2
SOTA (sentence pair
alignment model) [Parikh et
al. 2016]
86.8
Particular success cases:
Negation:
- The rhythmic gymnast completes her floor exercise at the competition
- The gymnast cannot finish her exercise
Long examples (>20 words):
- A man wearing glasses and a ragged costume is playing a Jaguar electric
guitar and singing with the accompaniment of a drummer
- A man with glasses and a disheveled outfit is playing a guitar and singing
along with a drummer.
12. Where are we for patents?
âȘ Latent Semantic Indexing
â Some commercial systems claim
to use it
âȘ âLatent semantic analysis uses
sophisticated statistical
analysis of language to search
on concepts, not just words, to
help you find those documents
- even if they don't contain any
of the words you used in your
searchâ
â Minimal improvements found in
experiments
âȘ [Moldovan:2005]
13. Random Indexing
âȘ Initial experiments using the Semantic Vectors package
â Unsatisfactory results for document similarity
â Noticeably good results for term similarity
Term vectors
Document vectors
[Lupu et al.:2013]
14. Random Indexing
âȘ Initial experiments using the Semantic Vectors package
â Unsatisfactory results for document similarity
â Noticeably good results for term similarity
Term vectors
Document vectors
1.0:coatings
0.9999339:rubs
0.9999338:coating
0.9999328:acrylics
0.9999271:vinyls
0.9999268:cratering
0.9999251:distinctness
0.9999246:blistering
0.9999235:pompano
0.9999234:cyanamid
1.0:crystal
0.9999378:cyrstal
0.9999305:crytal
0.9999022:nicol // a type of prism
0.9999014:jjap
0.9999006:nicols
0.9998996:nematic // a type of liquid crystal
0.9998943:uniaxial //minerals that form crystals used in optics
0.9998894:cb15 //a particular liquid crystal
0.9998887:anisotropy
1.0:crystals
0.9998632:supersaturation
0.9998519:crystallizing
0.9998281:supersaturated
0.9998213:crys
0.9998193:purer
0.9998166:soda
0.9998120:crystallize
0.9998105:crystallizers
0.9998081:tals
[Lupu et al.:2013]
19. documents are too large
Particular success cases:
Negation:
- The rhythmic gymnast completes her floor exercise at the competition
- The gymnast cannot finish her exercise
Long examples (>20 words):
- A man wearing glasses and a ragged costume is playing a Jaguar electric
guitar and singing with the accompaniment of a drummer
- A man with glasses and a disheveled outfit is playing a guitar and singing
along with a drummer.
20. words are too simple
âIn a railroad car truck, a windowed side frame, a bolster extending
through the window, a wedge pocket in said bolster having an
upwardly and outwardly inclined floor in opposition to a vertical
wear surface on the side frame, a stabilizing wedge in the pocket
having a vertical friction surface in contact with the wear surface on
the side frame and an inclined wedging surface in opposition to the
floor of the pocket, a removable wear plate inset in a recess In said
inclined floor, said recess having a horizontal lower edge, said wear
plate having an inclined lower edge formed and adapted to engage
and be supported on said horizontal lower edge of said recess, said
wear plate being held in said recess by a weldment located
between the upper edge of said recess and the lower edge of said
wear plate, and, a spring biasing the wedge upwardly against the
removable wear plate to cam the wedge laterally against the wear
surface on the side frame.â
How much is the patent corpus covered by the CELEX
lexical database?
[Verberne et al., 2010]
Patent data COBUILD corpus
Tokens 96% 92%
Types 55% (?)
22. words are too simple
Query Generation [Andersson:2016]
â Baseline, NLP:(word, phrases) and Statistically:(unigram, bigram)
â Section Claims or entire document
â Termhood
âȘ Experiment to learn termhoodness, two sample sets:
â 637 with C-value and 4,400 without C-value
âȘ upper boundary (manual list) versus machine learning
âȘ Skip-gram versus exact phrase,
âȘ Technical terms versus or non-technical
24. Artificial Intelligence - Will it ever come?
a machine will pass the Turing test by 2029
(Kurzweil 1999, pp. 189-235.)
* The Turing Test does not
specify the use of patents
in the conversation
26. Glossary
âȘ CBOW Continuous Bag-of-Words
âȘ DBPedia Automatically extracted knowledge resource from Wikipedia
âȘ dimensionality reduction Any procedure that takes as input a vector of size N and outputs a vector of size
M<N
âȘ feed-forward a particular type of neural network, which does not contain cycles between its neurons
âȘ hypernym a term denoting a broader category than another
âȘ hyponym a term denoting a narrower category than another
âȘ LOD Linked Open Data
âȘ LSA Latent Semantic Analysis
âȘ LSI Latent Semantic Indexing
âȘ LSTM Long Short Term Memory
âȘ matrix decomposition a mathematical procedure to represent a matrix as the product of two or more
matrices
âȘ matrix factorization matrix decomposition
âȘ neural networks an algorithmic model (loosely) simulating brain structures
âȘ ontology (here) a knowledge representation resource
âȘ OWL Web Ontology Language
âȘ PCA Principal Component Analysis
âȘ PMI Pointwise Mutual Information
âȘ RDF Resource Description Framework
âȘ recurrent nn a particular type of neural network, which contains cycles between its neurons
âȘ RI Random Indexing
âȘ skip-grams method to predict a context from a word
âȘ SVD Singular Value Decomposition
âȘ WordNet a large lexical database of English