An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

An Introduction to NLP4L:  
Natural Language Processing Tool for Apache Lucene
Tomoko Uchida
Consultant, Rondhuit Co. Ltd.

3
Who am I
• Tomoko Uchida (@moco_beta)
• Luke (Lucene Toolbox) collaborator (2015 ~)
• https://github.com/DmitryKey/luke
• The best-known tool for debugging and
learning Lucene/Solr, Elasticsearch index :-)

4
Agenda
• What’s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)

5
Agenda
• What’s NLP4L?

6
What’s NLP4L?
• GOAL
• Improve Lucene users’ search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Apprications (e.g. Transliteration)

7
What’s NLP4L?
• GOAL
• FEATURES
• NLP4L provides
• Provision of ML algorithms and Apprications (e.g. Transliteration)

8
What’s NLP4L?
• GOAL
• FEATURES
• NLP4L provides
• Provision of ML algorithms and Applications (e.g. Transliteration)

9
Agenda
• What’s NLP4L?

targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
10
Evaluation Measures

targetresult
tpfp fn
tn
11
Evaluation Measures

targetresult
tpfp fn
tn
12
Evaluation Measures

targetresult
tpfp fn
tn
13
Evaluation Measures

Recall ,Precision
tpfp fn
tn
14

Recall ,Precision
targetresult
tpfp fn
tn
15

n-gram, synonym dictionary, etc.
facet (ﬁlter query)
Ranking Tuning
recall
precision
recall , precision
16
Solution

Ranking Tuning
recall
precision
recall , precision
17
Solution

e.g. Transliteration
recall
precision
recall , precision
Ranking Tuning
18
Solution

e.g. Transliteration
e.g. Named Entity Extraction
recall
precision
recall , precision
Ranking Tuning
19
Solution

q=watch
20
targetresult
gradual precision improvement

ﬁlter by
Gender=Men s
21
targetresult

22
targetresult
ﬁlter by
Gender=Men s
ﬁlter by
Price=100-150

ID product price gender
1
CURREN New Men s Date Stainless Steel Military Sport
Quartz Wrist Watch
8.92 Men s
2 Suiksilver The Gamer Watch 87.99 Men s
23
Structured Documents

ID article
1
David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory.
The Prime Minister will be hoping his majority government will give him extra leverage in Brussels.
2
He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said
he will campaign for Britain to remain in the EU if he gets the reforms he wants.
24
Unstructured Documents

I
D
article person org loc
1
David Cameron says he has a mandate to pursue EU reform following the Conservatives' general
election victory. The Prime Minister will be hoping his majority government will give him extra
leverage in Brussels.
David Cameron EU Brussels
2
He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of
2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants.
EU
UK
Britain
NEE[1] extracts interesting words.
[1] Named Entity Extraction
25
Make them Structured

26
Agenda
• What’s NLP4L?

27
Language Model
• LM represents the ﬂuency of language

28
Language Model

• N-gram model is the LM which is most widely
used
29
Language Model

• N-gram model is the LM which is most widely
used
• Calculation example for 2-gram
30
totalTermFreq(”word2g”,”an apple”)
totalTermFreq(”word”,”an”)
Language Model

Alice/NNP ate/VB an/AT apple/NNP ./.
Mike/NNP likes/VB an/AT orange/NNP ./.
An/AT apple/NNP is/VB red/JJ ./.
NNP Proper noun, singular
VB Verb
AT Article
JJ Adjective
. period
31
Our Corpus for training
Part-of-Speech Tagging

33
Series of Words
Hidden Markov Model

34
Series of Part-of-Speech
Hidden Markov Model

NNP
0.667
VB
0.0
.
0.0
JJ
0.0
AT
0.333
1.0
1.0
0.4 0.6
0.6670.333
37
alice 0.2
apple 0.4
mike 0.2
orange 0.2
ate 0.333
is 0.333
likes 0.333
an 1.0
red 1.0
. 1.0
HMM state diagram

38
Agenda
• What’s NLP4L?

39
Transliteration
Transliteration is a process of transcribing letters
or words from one alphabet to another one to
facilitate comprehension and pronunciation for
non-native speakers.

computer コンピューター
server サーバー
internet インターネット
mouse マウス
information インフォメーション
examples of transliteration from English to Japanese
40
Transliteration

you search English mouse
41
It helps improve recall

but you got マウス (=mouse)
highlighted in Japanese
42
It helps improve recall

academy,アカデミー
accent,アクセント
access,アクセス
accident,アクシデント
acrobat,アクロバット
action,アクション
adapter,アダプター
africa,アフリカ
airbus,エアバス
alaska,アラスカ
alcohol,アルコール
allergy,アレルギー
train_data/alpha_katakana.txt
43
Training data in NLP4L

アaカcaデdeミーmy
アaクcセceンnトt
アaクcセceスss
アaクcシciデdeンnトt
アaクcロroバッbaトt
アaクcショtioンn
アaダdaプpターter
アaフfリriカca
エaアirバbuスs
アaラlaスsカka
アaルlコーcohoルl
アaレlleルrギーgy
train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt
44
accent,アクセント
access,アクセス
acrobat,アクロバット
adapter,アダプター
africa,アフリカ
airbus,エアバス
alaska,アラスカ
alcohol,アルコール
allergy,アレルギー
Training data in NLP4L

nlp4l> :load examples/trans_katakana_alpha.scala
45
Demo: Transliteration
val indexer = new HmmModelIndexer(index)
val ﬁle = Source.fromFile("train_data/alpha_katakana_aligned.txt", "UTF-8")
val pattern: Regex = """([u30A0-u30FF]+)([a-zA-Z]+)(.*)""".r
def align(result: List[(String, String)], str: String): List[(String, String)] = {
str match {
case pattern(a, b, c) =>
align(result :+ (a, b), c)
case _ =>
result
}
}
// create hmm model index
ﬁle.getLines.foreach{ line: String =>
val doc = align(List.empty[(String, String)], line)
indexer.addDocument(doc)
}

Input Prediction Right Answer
アルゴリズム algorism algorithm
プログラム program (OK)
ケミカル chaemmical chemical
ダイニング dining (OK)
コミッター committer (OK)
エントリー entree entry
46
Demo: Transliteration

① crawl
gathering
Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
アルゴリズム
algorism
calculate
edit distance
synonyms.txt
47
store pair of strings
if edit distance
is small enough
②
③
④
⑤
⑥
Gathering loan words

48
Agenda
• What’s NLP4L?

49
NLP4L Framework
• A framework that improves search experience (for mainly Lucene-
based search system). Pluggable.
• Reference implementation of plug-ins and corpora provided.
• Uses NLP/ML technologies to output models, dictionaries and
indexes.
• Since NLP/ML are not perfect, an interface that enables users to
personally examine output dictionaries is provided as well.

50
NLP4L Framework
indexes.

51
NLP4L Framework
indexes.

52
NLP4L Framework
indexes.

53
Solr
ES
Mahout Spark
Data Source
・Corpus (Text data, Lucene index)
・Query Log
・Access Log
Dictionaries
・Suggestion
(auto complete)
・Did you mean?
・synonyms.txt
・userdic.txt
・keyword attachment
maintenance
Model ﬁles
Tagged
Corpus
Document
Vectors
・TermExtractor
・Transliteration
・NEE
・Classiﬁcation
・Document Vectors
・Language Detection
・Learning to Rank
・Personalized Search

54
Solr
ES
Mahout Spark
Data Source
・Query Log
・Access Log
Dictionaries
・Suggestion
(auto complete)
・Did you mean?
・synonyms.txt
・userdic.txt
maintenance
Model ﬁles
Tagged
Corpus
Document
Vectors
・TermExtractor
・Transliteration
・NEE
・Classiﬁcation
・Document Vectors
・Learning to Rank

55
Solr
ES
Data Source
・Query Log
・Access Log
Dictionaries
・Suggestion
(auto complete)
・Did you mean?
・synonyms.txt
・userdic.txt
maintenance
・TermExtractor
・Transliteration
・NEE
・Classiﬁcation
・Document Vectors
・Learning to Rank
Mahout Spark
Model ﬁles
Document
Vectors
Tagged
Corpus

56
Mahout Spark
Data Source
・Query Log
・Access Log
Dictionaries
・Suggestion
(auto complete)
・Did you mean?
・synonyms.txt
・userdic.txt
maintenance
Model ﬁles
Tagged
Corpus
Document
Vectors
・TermExtractor
・Transliteration
・NEE
・Classiﬁcation
・Document Vectors
・Learning to Rank
Solr
ES

57
example: Keyword Attachment
Information about associated
Solr collection (core)
NLP/ML task (processor) chain
described by HOCON (Human-
Optimized Conﬁg Object Notation)
UI prototype for NLP4L Framework (Lucia)
https://github.com/NLP4L/lucia

58
Extracted keywords from whole documents
ex.) Named Entities by OpenNLP

59
Information about associated
Solr document (unique key, etc.)
Extracted keywords
from this document
Solr ﬁeld name for
each keyword

60
Check the keywords and remove
wrong / inappropriate entries

61
Synch (attach) all keywords to Solr documents
(by partial update command)

62
Solr document (befere keywords are attached)

63
Solr document (after keywords are attached)

64
If you delete keyword(s) already have been
attached to solr documents,
the keyword(s) also will be removed from solr
index when next “synch” action executed.

65
Lucene
doc
Lucene
doc
keyword
↑
Increase boost
Keyword Attachment Application
• “Keyword attachment” is a general format that enables the following functions.
• Learning to Rank
• Personalized Search
• Named Entity Extraction
• Document Classiﬁcation

66
targetresult
1 2
3 …
50 100
500 …
Before Learning to Rank

67
targetresult
1 2
3 …
50 100
500 …
After Learning to Rank

• Program learns, from access log and other
sources, that the score of document d for a
query q should be larger than the normal
score(q,d)
68
Lucene
doc d
q, q, …
https://en.wikipedia.org/wiki/Learning_to_rank
Learning to Rank

69
targetresult
1 2
3 …
50 100
500 …
q=apple
computer …
Personalized Search

70
target
result
50 100
500 …
1 2
3 …
q=apple
fruit …
Personalized Search

71
Lucene
doc d1
q1u1, q2u2
Lucene
doc d2
q2u1, q1u2
Personalized Search
• Program learns, from access log and other sources, that the score of
document d for a query q by user u should be larger than the
normal score(q,d)
• Since you cannot specify score(q,d,u) as Lucene restricts doing so,
you have to specify score(qu,d).
• Limit the data to high-order queries or divide ﬁelds depending on a
user as the number of q-u combinations can be enormous.

72
example: Generating Synonyms (loanwords)
Execute job that generate pairs of Katakana and
corresponding English words from corpus

73
Make adjustments in auto generated pairs
(candidate synonyms) via web UI

74
acacia,アカシア
acatenango,アカテナンゴ
access,アクセス
active,アクティブ
activision,アクティビジョン
acton,アクトン
actor,アクター
……
Exported pairs can be used in
SynonymFilter
synonyms_loadwords_ja.txt

75
Contact us at
koji at apache dot org
for the details.
Join and Code with Us!
https://github.com/NLP4L

An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.

Semelhante a An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd. (20)

Mais de Lucidworks

Mais de Lucidworks (20)

Último

Último (20)

An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene: Presented by Tomoko Uchida, Rondhuit Co. Ltd.