SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
An Introduction to
NLP4L
Natural Language Processing tool for
Apache Lucene
Koji Sekiguchi @kojisays
Founder & CEO, RONDHUIT
My contributions
• CharFilter framework & MappingCharFilter
• FastVectorHighlighter
2
Agenda
• What s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
3
Agenda
• What s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
4
What s NLP4L?
5
What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
6
What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
7
What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
8
Agenda
• What s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
9
Evaluation Measures
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
10
Recall ,Precision
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
11
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
Recall ,Precision
12
Solution
n-gram, synonym dictionary, etc.
facet (filter query)
Ranking Tuning
recall
precision
recall , precision
13
Solution
n-gram, synonym dictionary, etc.
facet (filter query)
Ranking Tuning
recall
precision
recall , precision
14
Solution
n-gram, synonym dictionary, etc.
e.g. Transliteration
facet (filter query)
recall
precision
recall , precision
Ranking Tuning
15
Solution
n-gram, synonym dictionary, etc.
e.g. Transliteration
facet (filter query)
e.g. Named Entity Extraction
recall
precision
recall , precision
Ranking Tuning
16
gradual precision improvement
q=watch
targetresult
17
filter by
Gender=Men s
targetresult
gradual precision improvement
18
targetresult
filter by
Gender=Men s
filter by
Price=100-150
gradual precision improvement
19
Structured Documents
ID product price gender
1
CURREN New Men s Date Stainless
Steel Military Sport Quartz Wrist Watch
8.92 Men s
2 Suiksilver The Gamer Watch 87.99 Men s
20
Unstructured Documents
ID article
1
David Cameron says he has a mandate to pursue EU reform following the
Conservatives' general election victory. The Prime Minister will be hoping his
majority government will give him extra leverage in Brussels.
2
He wants to renegotiate the terms of the UK's membership ahead of a
referendum by the end of 2017. He has said he will campaign for Britain to
remain in the EU if he gets the reforms he wants.
21
Make them Structured
I
D
article person org loc
1
David Cameron says he has a mandate to pursue EU reform following
the Conservatives' general election victory. The Prime Minister will be
hoping his majority government will give him extra leverage in Brussels.
David
Cameron
EU
Bruss
els
2
He wants to renegotiate the terms of the UK's membership ahead of a
referendum by the end of 2017. He has said he will campaign for
Britain to remain in the EU if he gets the reforms he wants.
EU
UK
Britai
n
NEE[1] extracts interesting words.
[1] Named Entity Extraction
22
Manual Tagging using brat
23
Agenda
• What s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
24
Language Model
• LM represents the fluency of language
• N-gram model is the LM which is most widely
used
• Calculation example for 2-gram
totalTermFreq(”word2g”,”an apple”)
totalTermFreq(”word”,”an”)
25
Alice/NNP ate/VB an/AT apple/NNP ./.
Mike/NNP likes/VB an/AT orange/NNP ./.
An/AT apple/NNP is/VB red/JJ ./.
NNP Proper noun, singular
VB Verb
AT Article
JJ Adjective
. period
Part-of-Speech Tagging
Our Corpus for training
26
Hidden Markov Model
27
Hidden Markov Model
Series of Words
28
Hidden Markov Model
Series of Part-of-Speech
29
Hidden Markov Model
30
Hidden Markov Model
31
HMM state diagram
NNP
0.667
VB
0.0
.
0.0
JJ
0.0
AT
0.333
1.0
1.0
0.4 0.6
0.6670.333
alice 0.2
apple 0.4
mike 0.2
orange 0.2
ate 0.333
is 0.333
likes 0.333
an 1.0
red 1.0
. 1.0
32
Agenda
• What s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
33
Transliteration
Transliteration is a process of transcribing letters or words from one
alphabet to another one to facilitate comprehension and pronunciation
for non-native speakers.
computer コンピューター
server サーバー
internet インターネット
mouse マウス
information インフォメーション
examples of transliteration from English to Japanese
34
It helps improve recall
you search English mouse
35
It helps improve recall
but you got マウス (=mouse)
highlighted in Japanese
36
Training data in NLP4L
アaカcaデdeミーmy
アaクcセceンnトt
アaクcセceスss
アaクcシciデdeンnトt
アaクcロroバッbaトt
アaクcショtioンn
アaダdaプpターter
アaフfリriカca
エaアirバbuスs
アaラlaスsカka
アaルlコーcohoルl
アaレlleルrギーgy
train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt
academy,アカデミー
accent,アクセント
access,アクセス
accident,アクシデント
acrobat,アクロバット
action,アクション
adapter,アダプター
africa,アフリカ
airbus,エアバス
alaska,アラスカ
alcohol,アルコール
allergy,アレルギー
37
Demo: Transliteration
Input Prediction Right Answer
アルゴリズム algorism algorithm
プログラム program (OK)
ケミカル chaemmical chemical
ダイニング dining (OK)
コミッター committer (OK)
エントリー entree entry
nlp4l> :load examples/trans_katakana_alpha.scala
38
Gathering loan words
① crawl
gathering
Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
アルゴリズム
algorism
calculate
edit distance
synonyms.txt
store pair of strings
if edit distance
is small enough
②
③
④
⑤
⑥
39
Gathering loan words
① crawl
gathering
Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
アルゴリズム
algorism
calculate
edit distance
synonyms.txt
store pair of strings
if edit distance
is small enough
②
③
④
⑤
⑥
Got 1,800+ records of
synonym knowledge
from jawiki
40
Agenda
• What s NLP4L?
• How NLP improves search experience
• Calculate probabilities using ShingleFilter
• Transliteration (Application for HMM)
• NLP4L Framework (coming soon)
41
NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
42
NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
43
NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
44
NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
45
Solr
ES
Mahout Spark
Data Source
・Corpus (Text data, Lucene index)
・Query Log
・Access Log
Dictionaries
・Suggestion
(auto complete)
・Did you mean?
・synonyms.txt
・userdic.txt
・keyword attachment
maintenance
Model files
Tagged
Corpus
Document
Vectors
・TermExtractor
・Transliteration
・NEE
・Classification
・Document Vectors
・Language Detection
・Learning to Rank
・Personalized Search
46
Keyword Attachment
• Keyword attachment is a general format that enables the
following functions.
• Learning to Rank
• Personalized Search
• Named Entity Extraction
• Document Classification
Lucene
doc
Lucene
doc
keyword
↑
Increase boost
47
Before Learning to Rank
targetresult
1 2
3 …
50 100
500 …
48
After Learning to Rank
targetresult
1 2
3 …
50 100
500 …
49
Learning to Rank
• Program learns, from access log and other
sources, that the score of document d for a
query q should be larger than the normal
score(q,d)
Lucene
doc d
q, q, …
https://en.wikipedia.org/wiki/Learning_to_rank
50
Personalized Search
targetresult
1 2
3 …
50 100
500 …
q=apple
computer …
51
Personalized Search
target
result
50 100
500 …
1 2
3 …
q=applefruit …
52
Personalized Search
• Program learns, from access log and other sources, that
the score of document d for a query q by user u should
be larger than the normal score(q,d)
• Since you cannot specify score(q,d,u) as Lucene restricts
doing so, you have to specify score(qu,d).
• Limit the data to high-order queries or divide fields
depending on a user as the number of q-u combinations
can be enormous.
Lucene
doc d1
q1u1, q2u2
Lucene
doc d2
q2u1, q1u2
53
Join and Code with Us!
Contact us at
koji at apache dot org
for the details.
54
Demo or
Q & A
Thank you!
55

Mais conteúdo relacionado

Mais procurados

10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
10 Keys to Solr's Future: Presented by Grant Ingersoll, LucidworksLucidworks
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginsearchbox-com
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveAndrea Gazzarini
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHPPaul Borgermans
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Lucidworks
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and FriendsRob Vesse
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Boosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesBoosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesLucidworks (Archived)
 
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...Lucidworks
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014francelabs
 
Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedRob Vesse
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchCloudera, Inc.
 

Mais procurados (20)

10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Boosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesBoosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User Preferences
 
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
 
Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking Revisited
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 

Semelhante a An Introduction to NLP4L

An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...Lucidworks
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This WorksSease
 
An Introduction to Natural Language Processing
An Introduction to Natural Language ProcessingAn Introduction to Natural Language Processing
An Introduction to Natural Language ProcessingTyrone Systems
 
Benchmarking nlp toolkits for enterprise application
Benchmarking nlp toolkits for enterprise applicationBenchmarking nlp toolkits for enterprise application
Benchmarking nlp toolkits for enterprise applicationConference Papers
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
EvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge Bases
EvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge BasesEvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge Bases
EvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge BasesSebastian Tramp
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text MiningSushanti Acharya
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Avelin Huo
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
 

Semelhante a An Introduction to NLP4L (20)

An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
 
An Introduction to Natural Language Processing
An Introduction to Natural Language ProcessingAn Introduction to Natural Language Processing
An Introduction to Natural Language Processing
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
 
Benchmarking nlp toolkits for enterprise application
Benchmarking nlp toolkits for enterprise applicationBenchmarking nlp toolkits for enterprise application
Benchmarking nlp toolkits for enterprise application
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Php packages
Php packagesPhp packages
Php packages
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
 
C8 akumaran
C8 akumaranC8 akumaran
C8 akumaran
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
EvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge Bases
EvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge BasesEvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge Bases
EvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge Bases
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text Mining
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 

Mais de Koji Sekiguchi

20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf
20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf
20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdfKoji Sekiguchi
 
Solr から使う OpenNLP の日本語固有表現抽出
Solr から使う OpenNLP の日本語固有表現抽出Solr から使う OpenNLP の日本語固有表現抽出
Solr から使う OpenNLP の日本語固有表現抽出Koji Sekiguchi
 
Learning-to-Rank meetup Vol. 1
Learning-to-Rank meetup Vol. 1Learning-to-Rank meetup Vol. 1
Learning-to-Rank meetup Vol. 1Koji Sekiguchi
 
Lucene 6819-good-bye-index-time-boost
Lucene 6819-good-bye-index-time-boostLucene 6819-good-bye-index-time-boost
Lucene 6819-good-bye-index-time-boostKoji Sekiguchi
 
NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習
NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習
NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習Koji Sekiguchi
 
コーパス学習による Apache Solr の徹底活用
コーパス学習による Apache Solr の徹底活用コーパス学習による Apache Solr の徹底活用
コーパス学習による Apache Solr の徹底活用Koji Sekiguchi
 
情報検索の基礎からデータの徹底活用まで
情報検索の基礎からデータの徹底活用まで情報検索の基礎からデータの徹底活用まで
情報検索の基礎からデータの徹底活用までKoji Sekiguchi
 
LUCENE-5252 NGramSynonymTokenizer
LUCENE-5252 NGramSynonymTokenizerLUCENE-5252 NGramSynonymTokenizer
LUCENE-5252 NGramSynonymTokenizerKoji Sekiguchi
 
情報検索におけるランキング計算の紹介
情報検索におけるランキング計算の紹介情報検索におけるランキング計算の紹介
情報検索におけるランキング計算の紹介Koji Sekiguchi
 
系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出
系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出
系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出Koji Sekiguchi
 
Luceneインデックスの共起単語分析とSolrによる共起単語サジェスチョン
Luceneインデックスの共起単語分析とSolrによる共起単語サジェスチョンLuceneインデックスの共起単語分析とSolrによる共起単語サジェスチョン
Luceneインデックスの共起単語分析とSolrによる共起単語サジェスチョンKoji Sekiguchi
 
Lucene terms extraction
Lucene terms extractionLucene terms extraction
Lucene terms extractionKoji Sekiguchi
 
Visualize terms network in Lucene index
Visualize terms network in Lucene indexVisualize terms network in Lucene index
Visualize terms network in Lucene indexKoji Sekiguchi
 
WikipediaからのSolr用類義語辞書の自動生成
WikipediaからのSolr用類義語辞書の自動生成WikipediaからのSolr用類義語辞書の自動生成
WikipediaからのSolr用類義語辞書の自動生成Koji Sekiguchi
 
OpenNLP - MEM and Perceptron
OpenNLP - MEM and PerceptronOpenNLP - MEM and Perceptron
OpenNLP - MEM and PerceptronKoji Sekiguchi
 
自然言語処理における機械学習による曖昧性解消入門
自然言語処理における機械学習による曖昧性解消入門自然言語処理における機械学習による曖昧性解消入門
自然言語処理における機械学習による曖昧性解消入門Koji Sekiguchi
 

Mais de Koji Sekiguchi (20)

20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf
20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf
20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf
 
Solr から使う OpenNLP の日本語固有表現抽出
Solr から使う OpenNLP の日本語固有表現抽出Solr から使う OpenNLP の日本語固有表現抽出
Solr から使う OpenNLP の日本語固有表現抽出
 
Learning-to-Rank meetup Vol. 1
Learning-to-Rank meetup Vol. 1Learning-to-Rank meetup Vol. 1
Learning-to-Rank meetup Vol. 1
 
Lucene 6819-good-bye-index-time-boost
Lucene 6819-good-bye-index-time-boostLucene 6819-good-bye-index-time-boost
Lucene 6819-good-bye-index-time-boost
 
NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習
NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習
NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習
 
Nlp4 l intro-20150513
Nlp4 l intro-20150513Nlp4 l intro-20150513
Nlp4 l intro-20150513
 
コーパス学習による Apache Solr の徹底活用
コーパス学習による Apache Solr の徹底活用コーパス学習による Apache Solr の徹底活用
コーパス学習による Apache Solr の徹底活用
 
情報検索の基礎からデータの徹底活用まで
情報検索の基礎からデータの徹底活用まで情報検索の基礎からデータの徹底活用まで
情報検索の基礎からデータの徹底活用まで
 
LUCENE-5252 NGramSynonymTokenizer
LUCENE-5252 NGramSynonymTokenizerLUCENE-5252 NGramSynonymTokenizer
LUCENE-5252 NGramSynonymTokenizer
 
情報検索におけるランキング計算の紹介
情報検索におけるランキング計算の紹介情報検索におけるランキング計算の紹介
情報検索におけるランキング計算の紹介
 
系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出
系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出
系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出
 
Luceneインデックスの共起単語分析とSolrによる共起単語サジェスチョン
Luceneインデックスの共起単語分析とSolrによる共起単語サジェスチョンLuceneインデックスの共起単語分析とSolrによる共起単語サジェスチョン
Luceneインデックスの共起単語分析とSolrによる共起単語サジェスチョン
 
Html noise reduction
Html noise reductionHtml noise reduction
Html noise reduction
 
Lucene terms extraction
Lucene terms extractionLucene terms extraction
Lucene terms extraction
 
Visualize terms network in Lucene index
Visualize terms network in Lucene indexVisualize terms network in Lucene index
Visualize terms network in Lucene index
 
WikipediaからのSolr用類義語辞書の自動生成
WikipediaからのSolr用類義語辞書の自動生成WikipediaからのSolr用類義語辞書の自動生成
WikipediaからのSolr用類義語辞書の自動生成
 
HMM viterbi
HMM viterbiHMM viterbi
HMM viterbi
 
NLP x Lucene/Solr
NLP x Lucene/SolrNLP x Lucene/Solr
NLP x Lucene/Solr
 
OpenNLP - MEM and Perceptron
OpenNLP - MEM and PerceptronOpenNLP - MEM and Perceptron
OpenNLP - MEM and Perceptron
 
自然言語処理における機械学習による曖昧性解消入門
自然言語処理における機械学習による曖昧性解消入門自然言語処理における機械学習による曖昧性解消入門
自然言語処理における機械学習による曖昧性解消入門
 

Último

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

An Introduction to NLP4L

  • 1. An Introduction to NLP4L Natural Language Processing tool for Apache Lucene Koji Sekiguchi @kojisays Founder & CEO, RONDHUIT
  • 2. My contributions • CharFilter framework & MappingCharFilter • FastVectorHighlighter 2
  • 3. Agenda • What s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon) 3
  • 4. Agenda • What s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon) 4
  • 6. What s NLP4L? • GOAL • Improve Lucene users search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration) 6
  • 7. What s NLP4L? • GOAL • Improve Lucene users search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration) 7
  • 8. What s NLP4L? • GOAL • Improve Lucene users search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration) 8
  • 9. Agenda • What s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon) 9
  • 10. Evaluation Measures targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 10
  • 11. Recall ,Precision tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 11
  • 12. targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) Recall ,Precision 12
  • 13. Solution n-gram, synonym dictionary, etc. facet (filter query) Ranking Tuning recall precision recall , precision 13
  • 14. Solution n-gram, synonym dictionary, etc. facet (filter query) Ranking Tuning recall precision recall , precision 14
  • 15. Solution n-gram, synonym dictionary, etc. e.g. Transliteration facet (filter query) recall precision recall , precision Ranking Tuning 15
  • 16. Solution n-gram, synonym dictionary, etc. e.g. Transliteration facet (filter query) e.g. Named Entity Extraction recall precision recall , precision Ranking Tuning 16
  • 19. targetresult filter by Gender=Men s filter by Price=100-150 gradual precision improvement 19
  • 20. Structured Documents ID product price gender 1 CURREN New Men s Date Stainless Steel Military Sport Quartz Wrist Watch 8.92 Men s 2 Suiksilver The Gamer Watch 87.99 Men s 20
  • 21. Unstructured Documents ID article 1 David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels. 2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants. 21
  • 22. Make them Structured I D article person org loc 1 David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels. David Cameron EU Bruss els 2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants. EU UK Britai n NEE[1] extracts interesting words. [1] Named Entity Extraction 22
  • 24. Agenda • What s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon) 24
  • 25. Language Model • LM represents the fluency of language • N-gram model is the LM which is most widely used • Calculation example for 2-gram totalTermFreq(”word2g”,”an apple”) totalTermFreq(”word”,”an”) 25
  • 26. Alice/NNP ate/VB an/AT apple/NNP ./. Mike/NNP likes/VB an/AT orange/NNP ./. An/AT apple/NNP is/VB red/JJ ./. NNP Proper noun, singular VB Verb AT Article JJ Adjective . period Part-of-Speech Tagging Our Corpus for training 26
  • 29. Hidden Markov Model Series of Part-of-Speech 29
  • 32. HMM state diagram NNP 0.667 VB 0.0 . 0.0 JJ 0.0 AT 0.333 1.0 1.0 0.4 0.6 0.6670.333 alice 0.2 apple 0.4 mike 0.2 orange 0.2 ate 0.333 is 0.333 likes 0.333 an 1.0 red 1.0 . 1.0 32
  • 33. Agenda • What s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon) 33
  • 34. Transliteration Transliteration is a process of transcribing letters or words from one alphabet to another one to facilitate comprehension and pronunciation for non-native speakers. computer コンピューター server サーバー internet インターネット mouse マウス information インフォメーション examples of transliteration from English to Japanese 34
  • 35. It helps improve recall you search English mouse 35
  • 36. It helps improve recall but you got マウス (=mouse) highlighted in Japanese 36
  • 37. Training data in NLP4L アaカcaデdeミーmy アaクcセceンnトt アaクcセceスss アaクcシciデdeンnトt アaクcロroバッbaトt アaクcショtioンn アaダdaプpターter アaフfリriカca エaアirバbuスs アaラlaスsカka アaルlコーcohoルl アaレlleルrギーgy train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー 37
  • 38. Demo: Transliteration Input Prediction Right Answer アルゴリズム algorism algorithm プログラム program (OK) ケミカル chaemmical chemical ダイニング dining (OK) コミッター committer (OK) エントリー entree entry nlp4l> :load examples/trans_katakana_alpha.scala 38
  • 39. Gathering loan words ① crawl gathering Katakana-Alphabet string pairs アルゴリズム, algorithm Transliteration アルゴリズム algorism calculate edit distance synonyms.txt store pair of strings if edit distance is small enough ② ③ ④ ⑤ ⑥ 39
  • 40. Gathering loan words ① crawl gathering Katakana-Alphabet string pairs アルゴリズム, algorithm Transliteration アルゴリズム algorism calculate edit distance synonyms.txt store pair of strings if edit distance is small enough ② ③ ④ ⑤ ⑥ Got 1,800+ records of synonym knowledge from jawiki 40
  • 41. Agenda • What s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon) 41
  • 42. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 42
  • 43. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 43
  • 44. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 44
  • 45. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 45
  • 46. Solr ES Mahout Spark Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log Dictionaries ・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment maintenance Model files Tagged Corpus Document Vectors ・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection ・Learning to Rank ・Personalized Search 46
  • 47. Keyword Attachment • Keyword attachment is a general format that enables the following functions. • Learning to Rank • Personalized Search • Named Entity Extraction • Document Classification Lucene doc Lucene doc keyword ↑ Increase boost 47
  • 48. Before Learning to Rank targetresult 1 2 3 … 50 100 500 … 48
  • 49. After Learning to Rank targetresult 1 2 3 … 50 100 500 … 49
  • 50. Learning to Rank • Program learns, from access log and other sources, that the score of document d for a query q should be larger than the normal score(q,d) Lucene doc d q, q, … https://en.wikipedia.org/wiki/Learning_to_rank 50
  • 51. Personalized Search targetresult 1 2 3 … 50 100 500 … q=apple computer … 51
  • 52. Personalized Search target result 50 100 500 … 1 2 3 … q=applefruit … 52
  • 53. Personalized Search • Program learns, from access log and other sources, that the score of document d for a query q by user u should be larger than the normal score(q,d) • Since you cannot specify score(q,d,u) as Lucene restricts doing so, you have to specify score(qu,d). • Limit the data to high-order queries or divide fields depending on a user as the number of q-u combinations can be enormous. Lucene doc d1 q1u1, q2u2 Lucene doc d2 q2u1, q1u2 53
  • 54. Join and Code with Us! Contact us at koji at apache dot org for the details. 54
  • 55. Demo or Q & A Thank you! 55