Presented at Berlin Buzzwords 2019
https://berlinbuzzwords.de/19/session/building-enterprise-natural-language-search-engine-elasticsearch-and-facebooks-drqa
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
Building an enterprise Natural Language Search Engine with ElasticSearch and Facebook’s DrQA
1. Building an enterprise Natural Language Search
Engine with ElasticSearch and Facebook’s DrQA
Louis Baligand, Debmalya Biswas
Berlin Buzzwords, 17 June 2019
Enterprise Architecture
2. PMI INFORMATION SERVICES 2016
About
2
https://github.com/philipmorrisintl
Debmalya Biswas
Louis Baligand
3. “Forrester defines cognitive search and knowledge discovery solutions as
A new generation of enterprise search solutions that employ AI
technologies such as natural language processing and machine learning
to ingest, understand, organize, and query digital content from multiple
data sources.
3
4. ““ The average interaction worker spends
[...] nearly 20 percent (of the workweek)
looking for internal information.”
-MGI Report, 2012.
4
Half (54%) of global information workers said, "My work
gets interrupted because I can't find or get access to
information I need to complete my tasks" a few times a
month or more often. -Forrester Data Global Business
Technographics Devices And Security Workforce Survey, 2016.
5. PMI INFORMATION SERVICES 2016
Enterprise Search vs. Web Search
6
Enterprise Search Web Searchvs.
Multiple content types
Limited tagging/metadata management
Role-based content trimming
Small amount of content
Single source (web pages)
Large investments in SEO (*)
(*): Search Engine Optimization
No visibility restrictions (public pages)
Enormous amount of content
No team in charge of Search Experience Search xxperience as core business
Employees are the end-users WWW users
7. PMI INFORMATION SERVICES 2016
Chatbots and Natural Language Search
Natural Language Search
(Neural Networks)
Works on documents.
Users can ask any question
from the documents.
Both the documents and
questions are passed through
the same Neural Network,
producing the matching
answer.
Intent based Chatbots
(Statistical Methods)
Requires Q&A knowledge.
Able to scale with respect to
question variants by
applying Statististical
Clustering Methods, e.g. tf-
idf, Bag-of-Words - to
cluster question variants
into ‘intents’.
(Rules based) FAQs
.
Works only for specific
hardcoded questions.
The only way to scale with
respect to question
variants, is to extend the
knowledgebase by
manually adding variants of
a question.
“How do I replace the heating
component of my iQoS?”
=
“Tell me how to change the
heating component of my iQoS”
<Q>
how replace
heating
component
iQoS
<Q>
how change
heating
component
iQoS
Same Intent
#repairIQOS
Document
data base
Neural
Network
<Q>
Neural
Network
(Offline)
(Real-time)
<A>
8. PMI INFORMATION SERVICES 2016
Chatbots and Natural Language Search (2)
3- tier strategy:
A Chatbot with its pre-
defined Q&A set remains the
entry point – think of it as
the 1st line of defense.
If the bot encounters a user
query which cannot be
mapped to one of its pre-
configured intents, it
performs a NLS over its KB.
This is the 2nd line of
defense.
If the user is not satisfied
even with search results, plan
for a final handover to a live
agent.
Ref: “Chatbots & Natural Language Search: 2 sides of
the same coin?” (link)
9. PMI INFORMATION SERVICES 2016
• End-user searching for
products (not answer)
• Filter-Oriented
• Rates, Review
10
Positioning vs e-commerce search
10. PMI INFORMATION SERVICES 2016
Philip Morris’ Use
case: Operator
Trainings
• Hundreds to thousands of operators
• Long manuals with specific terminology
• A 1min downtime of a machine would
lead to 20,000 cigarettes unmade
• Typical Full text Search (Boolean search,
no relevancy score)
• Document Management System
Manually classified
• On-boarding difficulty
11
11. PMI INFORMATION SERVICES 2016
Example of fine-grained results
12
Q. How many knives are there on the drums?
12. PMI INFORMATION SERVICES 2016
Question Answering?
• Squad Dataset: a reference in
Question Answering
• 100,000+ Q&A on Wikipedia
articles
• State of the art is beating
Human Performance
14
13. PMI INFORMATION SERVICES 2016
DrQA Overview
• Facebook AI Research, ACL
2017, Reading Wikipedia to
answer Open-Domain
Questions.
• Open source, BSD License
https://github.com/facebookr
esearch/DrQA
• Pre-trained model available
15
https://github.com/facebookresearch/DrQA
15. PMI INFORMATION SERVICES 2016
DrQA is easy to use on your own corpus!
17
$ pythonbuild_db.py /path/to/data /path/to/saved/db.db
$ pythonbuild_tfidf.py /path/to/doc/db /path/to/output/dir
0.06 0.02
0.03 0.08
Terms
Docs
$ pythoninteractive.py –reader-modelmultitask.mdl –retriever-modelpath/to/tfidf –doc-db path/to/saved/db.db
>>>process('Whatis theanswertolife,the universe,andeverything?’)
Top Predictions:
+------+--------+---------------------------------------------------+--------------+-----------+
| Rank| Answer| Doc | AnswerScore|DocScore|
+------+--------+---------------------------------------------------+--------------+-----------+
| 1 | 42 | PhrasesfromThe Hitchhiker'sGuide tothe Galaxy | 47242 | 141.26 |
+------+--------+---------------------------------------------------+--------------+-----------+
Pre-trained model open sourced
16. PMI INFORMATION SERVICES 2016
DrQA to answer Operator’s questions?
18
• Java toolkit to extract text + metadata from DOCX, PPT, XLS, PDF, JPEG, etc…
• Apache Software Foundation
• OCR
17. PMI INFORMATION SERVICES 2016
DrQA to answer Operator’s questions?
19
https://github.com/facebookresearch/DrQA
P@5: 76%
• Not a voice assistant
• End user needs at least ~95%
• Full control on the retriever
• First stage to prioritize
18. PMI INFORMATION SERVICES 2016
Introducing Elasticsearch
• Open source distributed
• Highly scalable
• RESTful API on top of Lucene capabilities
• Support for Full Text search (best of bread)
• Easy to configure + extend
• Seamlessly manage conflicts
• Active community & popular
21
20. PMI INFORMATION SERVICES 2016
Integrating Elasticsearch to DrQA’s pipeline
23
>>>fromdrqa.pipeline importDrQA
>>>fromdrqa.retrieverimportElasticDocRanker
>>>model= DrQA(reader_model=‘reader_model.mdl’,
ranker_config={'class':ElasticDocRanker,
'options':{'elastic_url':'127.0.0.1:9200’,
'elastic_index':'mini’, 'elastic_fields':'content’,
'elastic_field_doc_name':['file','filename’],
'elastic_field_content': 'content’}})
>>>model.process(’Howthe tensioningoftheV-belts shouldbe done?’)
Directly point to your server hosting Elastic Enable to search in any fields, e.g. uni-grams, bi-
grams, title, metadata, etc…
21. PMI INFORMATION SERVICES 2016
The pipeline performance
24
P@5: 76% 84%
P@5 ref.: 78%
(DrQA)
F1 score: 42%
F1 score ref.: 79%
(DrQA)
• DrQA span +- 10 tokens: 94% of 1st result contains true answer
22. PMI INFORMATION SERVICES 2016
Take aways
Address pain points by combining
Document Retrieval with Question
Answering
If not answered, it will provide much
more granular insights of the data
User elicitation & user experience: a
top down approach
End user does not know what to ask
25
23. PMI INFORMATION SERVICES 2016
Future work – Extend pipeline with BERT*
A general-purpose architecture to train models for multiple NLP tasks (sentiment analysis, etc…)
State of the art for SQuAD
Open source, published in Oct. 2018 by Google AI Research
High memory required: GPU with at least 12GB of RAM (Base model)
Enable to multi-language queries
26
*https://arxiv.org/abs/1810.04805, https://github.com/google-research/bert
• Add one layer to compute Pstart(“token”) & Pend(“token”) for each tokens
• Find the best pair by maximizing Pstart(“token1”) * Pend(“token2”)
This includes PMI taking part in the open source community. Check out our github to see the highly popular repos we have contributed to.
According to a report from McKinsey, workers spend 20% percent of their time looking for internal information ...
According to a report from McKinsey, workers spend 20% percent of their time looking for internal information ...
Content
The amount of content stored in PMI is ridiculously small compared to what Google crawls everyday. It should therefore be an advantage for PMI Enterprise Search.
Traffic
While from an infrastructure side, handling more traffic is more challenging, it actually offers more possibilities to capture data related to search queries that can be further used to train machine-learning algorithms to optimize the search experience (e.g. suggest / auto-complete queries and boost related most relevant pages). Google holds a huge advantage on that front.
Content Sources
At PMI, content is fragmented and stored in many locations. This creates difficulties to implement content crawlers that expose content in different ways (APIs, flat files + DB for meta-data etc.). On the Google side, only one type of content is actually crawled: Web pages (including all attached images etc.). Scale is larger but the variety smaller.
Information Management
At PMI, there are currently no strong practices around information architecture and management. Therefore, all types of information are mixed, irrespective of their business value for PMI, and relevance for the users. On the Google side, companies invest large amounts of resources into SEO and gain proper visibility in search results
SecurityIn an enterprise, access to documents / information are controlled to ensure compliance and protection of sensitive data. This poses a challenge when crawling sources which use ACLs to control access, since those ACLs must be imported to filter out search results that some users should not see. Google only works with public content, which completely removes that constraint.
Search Experience
Google’s core business relies on effective search and targeted advertisements. They hire the best engineers to work on AI to constantly adjust the quality of the search results. At PMI, there is no such team (even small scale) that is tasked to monitor and continuously improve search relevance. Also, from a skill perspective, it is unrealistic to think it is possible to get even close to Google.