SlideShare uma empresa Scribd logo
1 de 24
Building an enterprise Natural Language Search
Engine with ElasticSearch and Facebook’s DrQA
Louis Baligand, Debmalya Biswas
Berlin Buzzwords, 17 June 2019
Enterprise Architecture
PMI INFORMATION SERVICES 2016
About
2
https://github.com/philipmorrisintl
Debmalya Biswas
Louis Baligand
“Forrester defines cognitive search and knowledge discovery solutions as
A new generation of enterprise search solutions that employ AI
technologies such as natural language processing and machine learning
to ingest, understand, organize, and query digital content from multiple
data sources.
3
““ The average interaction worker spends
[...] nearly 20 percent (of the workweek)
looking for internal information.”
-MGI Report, 2012.
4
Half (54%) of global information workers said, "My work
gets interrupted because I can't find or get access to
information I need to complete my tasks" a few times a
month or more often. -Forrester Data Global Business
Technographics Devices And Security Workforce Survey, 2016.
PMI INFORMATION SERVICES 2016
Enterprise Search vs. Web Search
6
Enterprise Search Web Searchvs.
Multiple content types
Limited tagging/metadata management
Role-based content trimming
Small amount of content
Single source (web pages)
Large investments in SEO (*)
(*): Search Engine Optimization
No visibility restrictions (public pages)
Enormous amount of content
No team in charge of Search Experience Search xxperience as core business
Employees are the end-users WWW users
PMI INFORMATION SERVICES 2016
Natural Language Search (NLS)
7
Knowledge Graph
PMI INFORMATION SERVICES 2016
Chatbots and Natural Language Search
Natural Language Search
(Neural Networks)
 Works on documents.
 Users can ask any question
from the documents.
 Both the documents and
questions are passed through
the same Neural Network,
producing the matching
answer.
Intent based Chatbots
(Statistical Methods)
 Requires Q&A knowledge.
 Able to scale with respect to
question variants by
applying Statististical
Clustering Methods, e.g. tf-
idf, Bag-of-Words - to
cluster question variants
into ‘intents’.
(Rules based) FAQs
.
 Works only for specific
hardcoded questions.
 The only way to scale with
respect to question
variants, is to extend the
knowledgebase by
manually adding variants of
a question.
“How do I replace the heating
component of my iQoS?”
=
“Tell me how to change the
heating component of my iQoS”
<Q>
how replace
heating
component
iQoS
<Q>
how change
heating
component
iQoS
Same Intent
#repairIQOS
Document
data base
Neural
Network
<Q>
Neural
Network
(Offline)
(Real-time)
<A>
PMI INFORMATION SERVICES 2016
Chatbots and Natural Language Search (2)
3- tier strategy:
 A Chatbot with its pre-
defined Q&A set remains the
entry point – think of it as
the 1st line of defense.
 If the bot encounters a user
query which cannot be
mapped to one of its pre-
configured intents, it
performs a NLS over its KB.
This is the 2nd line of
defense.
 If the user is not satisfied
even with search results, plan
for a final handover to a live
agent.
Ref: “Chatbots & Natural Language Search: 2 sides of
the same coin?” (link)
PMI INFORMATION SERVICES 2016
• End-user searching for
products (not answer)
• Filter-Oriented
• Rates, Review
10
Positioning vs e-commerce search
PMI INFORMATION SERVICES 2016
Philip Morris’ Use
case: Operator
Trainings
• Hundreds to thousands of operators
• Long manuals with specific terminology
• A 1min downtime of a machine would
lead to 20,000 cigarettes unmade
• Typical Full text Search (Boolean search,
no relevancy score)
• Document Management System
Manually classified
• On-boarding difficulty
11
PMI INFORMATION SERVICES 2016
Example of fine-grained results
12
Q. How many knives are there on the drums?
PMI INFORMATION SERVICES 2016
Question Answering?
• Squad Dataset: a reference in
Question Answering
• 100,000+ Q&A on Wikipedia
articles
• State of the art is beating
Human Performance
14
PMI INFORMATION SERVICES 2016
DrQA Overview
• Facebook AI Research, ACL
2017, Reading Wikipedia to
answer Open-Domain
Questions.
• Open source, BSD License
https://github.com/facebookr
esearch/DrQA
• Pre-trained model available
15
https://github.com/facebookresearch/DrQA
PMI INFORMATION SERVICES 2016
DrQA Overview
16
Bigram
TFIDF
Bi-direct.
RNN
PMI INFORMATION SERVICES 2016
DrQA is easy to use on your own corpus!
17
$ pythonbuild_db.py /path/to/data /path/to/saved/db.db
$ pythonbuild_tfidf.py /path/to/doc/db /path/to/output/dir
0.06 0.02
0.03 0.08
Terms
Docs
$ pythoninteractive.py –reader-modelmultitask.mdl –retriever-modelpath/to/tfidf –doc-db path/to/saved/db.db
>>>process('Whatis theanswertolife,the universe,andeverything?’)
Top Predictions:
+------+--------+---------------------------------------------------+--------------+-----------+
| Rank| Answer| Doc | AnswerScore|DocScore|
+------+--------+---------------------------------------------------+--------------+-----------+
| 1 | 42 | PhrasesfromThe Hitchhiker'sGuide tothe Galaxy | 47242 | 141.26 |
+------+--------+---------------------------------------------------+--------------+-----------+
Pre-trained model open sourced
PMI INFORMATION SERVICES 2016
DrQA to answer Operator’s questions?
18
• Java toolkit to extract text + metadata from DOCX, PPT, XLS, PDF, JPEG, etc…
• Apache Software Foundation
• OCR
PMI INFORMATION SERVICES 2016
DrQA to answer Operator’s questions?
19
https://github.com/facebookresearch/DrQA
P@5: 76%
• Not a voice assistant
• End user needs at least ~95%
• Full control on the retriever
• First stage to prioritize
PMI INFORMATION SERVICES 2016
Introducing Elasticsearch
• Open source distributed
• Highly scalable
• RESTful API on top of Lucene capabilities
• Support for Full Text search (best of bread)
• Easy to configure + extend
• Seamlessly manage conflicts
• Active community & popular
21
PMI INFORMATION SERVICES 2016
Integrating Elasticsearch to DrQA’s pipeline
22
Index
PMI INFORMATION SERVICES 2016
Integrating Elasticsearch to DrQA’s pipeline
23
>>>fromdrqa.pipeline importDrQA
>>>fromdrqa.retrieverimportElasticDocRanker
>>>model= DrQA(reader_model=‘reader_model.mdl’,
ranker_config={'class':ElasticDocRanker,
'options':{'elastic_url':'127.0.0.1:9200’,
'elastic_index':'mini’, 'elastic_fields':'content’,
'elastic_field_doc_name':['file','filename’],
'elastic_field_content': 'content’}})
>>>model.process(’Howthe tensioningoftheV-belts shouldbe done?’)
Directly point to your server hosting Elastic Enable to search in any fields, e.g. uni-grams, bi-
grams, title, metadata, etc…
PMI INFORMATION SERVICES 2016
The pipeline performance
24
P@5: 76% 84%
P@5 ref.: 78%
(DrQA)
F1 score: 42%
F1 score ref.: 79%
(DrQA)
• DrQA span +- 10 tokens: 94% of 1st result contains true answer
PMI INFORMATION SERVICES 2016
Take aways
 Address pain points by combining
Document Retrieval with Question
Answering
 If not answered, it will provide much
more granular insights of the data
 User elicitation & user experience: a
top down approach
 End user does not know what to ask
25
PMI INFORMATION SERVICES 2016
Future work – Extend pipeline with BERT*
 A general-purpose architecture to train models for multiple NLP tasks (sentiment analysis, etc…)
 State of the art for SQuAD
 Open source, published in Oct. 2018 by Google AI Research
 High memory required: GPU with at least 12GB of RAM (Base model)
 Enable to multi-language queries
26
*https://arxiv.org/abs/1810.04805, https://github.com/google-research/bert
• Add one layer to compute Pstart(“token”) & Pend(“token”) for each tokens
• Find the best pair by maximizing Pstart(“token1”) * Pend(“token2”)
Thank you.

Mais conteúdo relacionado

Mais procurados

State of the State: What’s Happening in the Database Market?
State of the State: What’s Happening in the Database Market?State of the State: What’s Happening in the Database Market?
State of the State: What’s Happening in the Database Market?Neo4j
 
Vijayananda Mohire-dissertation-abstract
Vijayananda Mohire-dissertation-abstractVijayananda Mohire-dissertation-abstract
Vijayananda Mohire-dissertation-abstractVijayananda Mohire
 
Cognos Data Module Architectures & Use Cases
Cognos Data Module Architectures & Use CasesCognos Data Module Architectures & Use Cases
Cognos Data Module Architectures & Use CasesSenturus
 
Applications and approaches_to_object_or
Applications and approaches_to_object_orApplications and approaches_to_object_or
Applications and approaches_to_object_orSalim Uçar
 
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformDriven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformArne Roßmann
 
On24 oracle-machine-learning-platform-12-feb-2020-webcast
On24 oracle-machine-learning-platform-12-feb-2020-webcastOn24 oracle-machine-learning-platform-12-feb-2020-webcast
On24 oracle-machine-learning-platform-12-feb-2020-webcastTill Huber
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Mark Goldstein
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy snehal parikh
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]Shirshanka Das
 
How data modelling helps serve billions of queries in millisecond latency wit...
How data modelling helps serve billions of queries in millisecond latency wit...How data modelling helps serve billions of queries in millisecond latency wit...
How data modelling helps serve billions of queries in millisecond latency wit...DataWorks Summit
 
Unified Information Governance, Powered by Knowledge Graph
Unified Information Governance, Powered by Knowledge GraphUnified Information Governance, Powered by Knowledge Graph
Unified Information Governance, Powered by Knowledge GraphVaticle
 
"Data Annotation at Scale: Pitfalls and Solutions," a Presentation from Intel
"Data Annotation at Scale: Pitfalls and Solutions," a Presentation from Intel"Data Annotation at Scale: Pitfalls and Solutions," a Presentation from Intel
"Data Annotation at Scale: Pitfalls and Solutions," a Presentation from IntelEdge AI and Vision Alliance
 
Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...
Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...
Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...TigerGraph
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
SqlSaturday#699 Power BI - Create a dashboard from zero to hero
SqlSaturday#699 Power BI - Create a dashboard from zero to heroSqlSaturday#699 Power BI - Create a dashboard from zero to hero
SqlSaturday#699 Power BI - Create a dashboard from zero to heroVishal Pawar
 
Citizen Data Science Training using KNIME
Citizen Data Science Training using KNIMECitizen Data Science Training using KNIME
Citizen Data Science Training using KNIMEAli Raza Anjum
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInSam Shah
 
Data science with python certification training course with
Data science with python certification training course withData science with python certification training course with
Data science with python certification training course withkiruthikab6
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Tina Zhang
 

Mais procurados (20)

Digital transformation and AI @Edge
Digital transformation and AI @EdgeDigital transformation and AI @Edge
Digital transformation and AI @Edge
 
State of the State: What’s Happening in the Database Market?
State of the State: What’s Happening in the Database Market?State of the State: What’s Happening in the Database Market?
State of the State: What’s Happening in the Database Market?
 
Vijayananda Mohire-dissertation-abstract
Vijayananda Mohire-dissertation-abstractVijayananda Mohire-dissertation-abstract
Vijayananda Mohire-dissertation-abstract
 
Cognos Data Module Architectures & Use Cases
Cognos Data Module Architectures & Use CasesCognos Data Module Architectures & Use Cases
Cognos Data Module Architectures & Use Cases
 
Applications and approaches_to_object_or
Applications and approaches_to_object_orApplications and approaches_to_object_or
Applications and approaches_to_object_or
 
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics PlatformDriven by data - Why we need a Modern Enterprise Data Analytics Platform
Driven by data - Why we need a Modern Enterprise Data Analytics Platform
 
On24 oracle-machine-learning-platform-12-feb-2020-webcast
On24 oracle-machine-learning-platform-12-feb-2020-webcastOn24 oracle-machine-learning-platform-12-feb-2020-webcast
On24 oracle-machine-learning-platform-12-feb-2020-webcast
 
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
Phoenix Data Conference - Big Data Analytics for IoT 11/4/17
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
 
How data modelling helps serve billions of queries in millisecond latency wit...
How data modelling helps serve billions of queries in millisecond latency wit...How data modelling helps serve billions of queries in millisecond latency wit...
How data modelling helps serve billions of queries in millisecond latency wit...
 
Unified Information Governance, Powered by Knowledge Graph
Unified Information Governance, Powered by Knowledge GraphUnified Information Governance, Powered by Knowledge Graph
Unified Information Governance, Powered by Knowledge Graph
 
"Data Annotation at Scale: Pitfalls and Solutions," a Presentation from Intel
"Data Annotation at Scale: Pitfalls and Solutions," a Presentation from Intel"Data Annotation at Scale: Pitfalls and Solutions," a Presentation from Intel
"Data Annotation at Scale: Pitfalls and Solutions," a Presentation from Intel
 
Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...
Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...
Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
SqlSaturday#699 Power BI - Create a dashboard from zero to hero
SqlSaturday#699 Power BI - Create a dashboard from zero to heroSqlSaturday#699 Power BI - Create a dashboard from zero to hero
SqlSaturday#699 Power BI - Create a dashboard from zero to hero
 
Citizen Data Science Training using KNIME
Citizen Data Science Training using KNIMECitizen Data Science Training using KNIME
Citizen Data Science Training using KNIME
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
Data science with python certification training course with
Data science with python certification training course withData science with python certification training course with
Data science with python certification training course with
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_
 

Semelhante a Building an enterprise Natural Language Search Engine with ElasticSearch and Facebook’s DrQA

KDD 2019 IADSS Workshop - Research Updates from Usama Fayyad & Hamit Hamutcu
KDD 2019 IADSS Workshop - Research Updates from Usama Fayyad & Hamit HamutcuKDD 2019 IADSS Workshop - Research Updates from Usama Fayyad & Hamit Hamutcu
KDD 2019 IADSS Workshop - Research Updates from Usama Fayyad & Hamit HamutcuIADSS
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Prof.Balakrishnan S
 
Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?Denodo
 
Webinar: NoSQL as the New Normal
Webinar: NoSQL as the New NormalWebinar: NoSQL as the New Normal
Webinar: NoSQL as the New NormalMongoDB
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Pinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at PinterestPinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at PinterestAlluxio, Inc.
 
3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your PortfolioDenodo
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdfPoornimaShetty27
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdfSreenivasa Harish
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsClusterpoint
 
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricCambridge Semantics
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Mark Tabladillo
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformVMware Tanzu
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBMongoDB
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Marvin Platform – Potencializando equipes de Machine Learning
Marvin Platform – Potencializando equipes de Machine LearningMarvin Platform – Potencializando equipes de Machine Learning
Marvin Platform – Potencializando equipes de Machine LearningDaniel Takabayashi, MSc
 

Semelhante a Building an enterprise Natural Language Search Engine with ElasticSearch and Facebook’s DrQA (20)

BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
KDD 2019 IADSS Workshop - Research Updates from Usama Fayyad & Hamit Hamutcu
KDD 2019 IADSS Workshop - Research Updates from Usama Fayyad & Hamit HamutcuKDD 2019 IADSS Workshop - Research Updates from Usama Fayyad & Hamit Hamutcu
KDD 2019 IADSS Workshop - Research Updates from Usama Fayyad & Hamit Hamutcu
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?
 
Webinar: NoSQL as the New Normal
Webinar: NoSQL as the New NormalWebinar: NoSQL as the New Normal
Webinar: NoSQL as the New Normal
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Pinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at PinterestPinterest - Big Data Machine Learning Platform at Pinterest
Pinterest - Big Data Machine Learning Platform at Pinterest
 
3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio3 Reasons Data Virtualization Matters in Your Portfolio
3 Reasons Data Virtualization Matters in Your Portfolio
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutions
 
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
 
MongoDB Tick Data Presentation
MongoDB Tick Data PresentationMongoDB Tick Data Presentation
MongoDB Tick Data Presentation
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data Platform
 
Webinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDBWebinar: Faster Big Data Analytics with MongoDB
Webinar: Faster Big Data Analytics with MongoDB
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Marvin Platform – Potencializando equipes de Machine Learning
Marvin Platform – Potencializando equipes de Machine LearningMarvin Platform – Potencializando equipes de Machine Learning
Marvin Platform – Potencializando equipes de Machine Learning
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 

Mais de Debmalya Biswas

Constraints Enabled Autonomous Agent Marketplace: Discovery and Matchmaking
Constraints Enabled Autonomous Agent Marketplace: Discovery and MatchmakingConstraints Enabled Autonomous Agent Marketplace: Discovery and Matchmaking
Constraints Enabled Autonomous Agent Marketplace: Discovery and MatchmakingDebmalya Biswas
 
Responsible Generative AI Design Patterns
Responsible Generative AI Design PatternsResponsible Generative AI Design Patterns
Responsible Generative AI Design PatternsDebmalya Biswas
 
Sustainable & Composable Generative AI
Sustainable & Composable Generative AISustainable & Composable Generative AI
Sustainable & Composable Generative AIDebmalya Biswas
 
Data-Driven (Reinforcement Learning-Based) Control
Data-Driven (Reinforcement Learning-Based) ControlData-Driven (Reinforcement Learning-Based) Control
Data-Driven (Reinforcement Learning-Based) ControlDebmalya Biswas
 
Regulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyRegulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyDebmalya Biswas
 
MLOps for Compositional AI
MLOps for Compositional AIMLOps for Compositional AI
MLOps for Compositional AIDebmalya Biswas
 
A Privacy Framework for Hierarchical Federated Learning
A Privacy Framework for Hierarchical Federated LearningA Privacy Framework for Hierarchical Federated Learning
A Privacy Framework for Hierarchical Federated LearningDebmalya Biswas
 
Edge AI Framework for Healthcare Applications
Edge AI Framework for Healthcare ApplicationsEdge AI Framework for Healthcare Applications
Edge AI Framework for Healthcare ApplicationsDebmalya Biswas
 
Privacy Preserving Chatbot Conversations
Privacy Preserving Chatbot ConversationsPrivacy Preserving Chatbot Conversations
Privacy Preserving Chatbot ConversationsDebmalya Biswas
 
Reinforcement Learning based HVAC Optimization in Factories
Reinforcement Learning based HVAC Optimization in FactoriesReinforcement Learning based HVAC Optimization in Factories
Reinforcement Learning based HVAC Optimization in FactoriesDebmalya Biswas
 
Delayed Rewards in the context of Reinforcement Learning based Recommender ...
Delayed Rewards in the context of Reinforcement Learning based Recommender ...Delayed Rewards in the context of Reinforcement Learning based Recommender ...
Delayed Rewards in the context of Reinforcement Learning based Recommender ...Debmalya Biswas
 
Privacy-Preserving Outsourced Profiling
Privacy-Preserving Outsourced ProfilingPrivacy-Preserving Outsourced Profiling
Privacy-Preserving Outsourced ProfilingDebmalya Biswas
 
Privacy Policies Change Management for Smartphones
Privacy Policies Change Management for SmartphonesPrivacy Policies Change Management for Smartphones
Privacy Policies Change Management for SmartphonesDebmalya Biswas
 

Mais de Debmalya Biswas (13)

Constraints Enabled Autonomous Agent Marketplace: Discovery and Matchmaking
Constraints Enabled Autonomous Agent Marketplace: Discovery and MatchmakingConstraints Enabled Autonomous Agent Marketplace: Discovery and Matchmaking
Constraints Enabled Autonomous Agent Marketplace: Discovery and Matchmaking
 
Responsible Generative AI Design Patterns
Responsible Generative AI Design PatternsResponsible Generative AI Design Patterns
Responsible Generative AI Design Patterns
 
Sustainable & Composable Generative AI
Sustainable & Composable Generative AISustainable & Composable Generative AI
Sustainable & Composable Generative AI
 
Data-Driven (Reinforcement Learning-Based) Control
Data-Driven (Reinforcement Learning-Based) ControlData-Driven (Reinforcement Learning-Based) Control
Data-Driven (Reinforcement Learning-Based) Control
 
Regulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with TransparencyRegulating Generative AI - LLMOps pipelines with Transparency
Regulating Generative AI - LLMOps pipelines with Transparency
 
MLOps for Compositional AI
MLOps for Compositional AIMLOps for Compositional AI
MLOps for Compositional AI
 
A Privacy Framework for Hierarchical Federated Learning
A Privacy Framework for Hierarchical Federated LearningA Privacy Framework for Hierarchical Federated Learning
A Privacy Framework for Hierarchical Federated Learning
 
Edge AI Framework for Healthcare Applications
Edge AI Framework for Healthcare ApplicationsEdge AI Framework for Healthcare Applications
Edge AI Framework for Healthcare Applications
 
Privacy Preserving Chatbot Conversations
Privacy Preserving Chatbot ConversationsPrivacy Preserving Chatbot Conversations
Privacy Preserving Chatbot Conversations
 
Reinforcement Learning based HVAC Optimization in Factories
Reinforcement Learning based HVAC Optimization in FactoriesReinforcement Learning based HVAC Optimization in Factories
Reinforcement Learning based HVAC Optimization in Factories
 
Delayed Rewards in the context of Reinforcement Learning based Recommender ...
Delayed Rewards in the context of Reinforcement Learning based Recommender ...Delayed Rewards in the context of Reinforcement Learning based Recommender ...
Delayed Rewards in the context of Reinforcement Learning based Recommender ...
 
Privacy-Preserving Outsourced Profiling
Privacy-Preserving Outsourced ProfilingPrivacy-Preserving Outsourced Profiling
Privacy-Preserving Outsourced Profiling
 
Privacy Policies Change Management for Smartphones
Privacy Policies Change Management for SmartphonesPrivacy Policies Change Management for Smartphones
Privacy Policies Change Management for Smartphones
 

Último

%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456KiaraTiradoMicha
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyAnusha Are
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 

Último (20)

%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 

Building an enterprise Natural Language Search Engine with ElasticSearch and Facebook’s DrQA

  • 1. Building an enterprise Natural Language Search Engine with ElasticSearch and Facebook’s DrQA Louis Baligand, Debmalya Biswas Berlin Buzzwords, 17 June 2019 Enterprise Architecture
  • 2. PMI INFORMATION SERVICES 2016 About 2 https://github.com/philipmorrisintl Debmalya Biswas Louis Baligand
  • 3. “Forrester defines cognitive search and knowledge discovery solutions as A new generation of enterprise search solutions that employ AI technologies such as natural language processing and machine learning to ingest, understand, organize, and query digital content from multiple data sources. 3
  • 4. ““ The average interaction worker spends [...] nearly 20 percent (of the workweek) looking for internal information.” -MGI Report, 2012. 4 Half (54%) of global information workers said, "My work gets interrupted because I can't find or get access to information I need to complete my tasks" a few times a month or more often. -Forrester Data Global Business Technographics Devices And Security Workforce Survey, 2016.
  • 5. PMI INFORMATION SERVICES 2016 Enterprise Search vs. Web Search 6 Enterprise Search Web Searchvs. Multiple content types Limited tagging/metadata management Role-based content trimming Small amount of content Single source (web pages) Large investments in SEO (*) (*): Search Engine Optimization No visibility restrictions (public pages) Enormous amount of content No team in charge of Search Experience Search xxperience as core business Employees are the end-users WWW users
  • 6. PMI INFORMATION SERVICES 2016 Natural Language Search (NLS) 7 Knowledge Graph
  • 7. PMI INFORMATION SERVICES 2016 Chatbots and Natural Language Search Natural Language Search (Neural Networks)  Works on documents.  Users can ask any question from the documents.  Both the documents and questions are passed through the same Neural Network, producing the matching answer. Intent based Chatbots (Statistical Methods)  Requires Q&A knowledge.  Able to scale with respect to question variants by applying Statististical Clustering Methods, e.g. tf- idf, Bag-of-Words - to cluster question variants into ‘intents’. (Rules based) FAQs .  Works only for specific hardcoded questions.  The only way to scale with respect to question variants, is to extend the knowledgebase by manually adding variants of a question. “How do I replace the heating component of my iQoS?” = “Tell me how to change the heating component of my iQoS” <Q> how replace heating component iQoS <Q> how change heating component iQoS Same Intent #repairIQOS Document data base Neural Network <Q> Neural Network (Offline) (Real-time) <A>
  • 8. PMI INFORMATION SERVICES 2016 Chatbots and Natural Language Search (2) 3- tier strategy:  A Chatbot with its pre- defined Q&A set remains the entry point – think of it as the 1st line of defense.  If the bot encounters a user query which cannot be mapped to one of its pre- configured intents, it performs a NLS over its KB. This is the 2nd line of defense.  If the user is not satisfied even with search results, plan for a final handover to a live agent. Ref: “Chatbots & Natural Language Search: 2 sides of the same coin?” (link)
  • 9. PMI INFORMATION SERVICES 2016 • End-user searching for products (not answer) • Filter-Oriented • Rates, Review 10 Positioning vs e-commerce search
  • 10. PMI INFORMATION SERVICES 2016 Philip Morris’ Use case: Operator Trainings • Hundreds to thousands of operators • Long manuals with specific terminology • A 1min downtime of a machine would lead to 20,000 cigarettes unmade • Typical Full text Search (Boolean search, no relevancy score) • Document Management System Manually classified • On-boarding difficulty 11
  • 11. PMI INFORMATION SERVICES 2016 Example of fine-grained results 12 Q. How many knives are there on the drums?
  • 12. PMI INFORMATION SERVICES 2016 Question Answering? • Squad Dataset: a reference in Question Answering • 100,000+ Q&A on Wikipedia articles • State of the art is beating Human Performance 14
  • 13. PMI INFORMATION SERVICES 2016 DrQA Overview • Facebook AI Research, ACL 2017, Reading Wikipedia to answer Open-Domain Questions. • Open source, BSD License https://github.com/facebookr esearch/DrQA • Pre-trained model available 15 https://github.com/facebookresearch/DrQA
  • 14. PMI INFORMATION SERVICES 2016 DrQA Overview 16 Bigram TFIDF Bi-direct. RNN
  • 15. PMI INFORMATION SERVICES 2016 DrQA is easy to use on your own corpus! 17 $ pythonbuild_db.py /path/to/data /path/to/saved/db.db $ pythonbuild_tfidf.py /path/to/doc/db /path/to/output/dir 0.06 0.02 0.03 0.08 Terms Docs $ pythoninteractive.py –reader-modelmultitask.mdl –retriever-modelpath/to/tfidf –doc-db path/to/saved/db.db >>>process('Whatis theanswertolife,the universe,andeverything?’) Top Predictions: +------+--------+---------------------------------------------------+--------------+-----------+ | Rank| Answer| Doc | AnswerScore|DocScore| +------+--------+---------------------------------------------------+--------------+-----------+ | 1 | 42 | PhrasesfromThe Hitchhiker'sGuide tothe Galaxy | 47242 | 141.26 | +------+--------+---------------------------------------------------+--------------+-----------+ Pre-trained model open sourced
  • 16. PMI INFORMATION SERVICES 2016 DrQA to answer Operator’s questions? 18 • Java toolkit to extract text + metadata from DOCX, PPT, XLS, PDF, JPEG, etc… • Apache Software Foundation • OCR
  • 17. PMI INFORMATION SERVICES 2016 DrQA to answer Operator’s questions? 19 https://github.com/facebookresearch/DrQA P@5: 76% • Not a voice assistant • End user needs at least ~95% • Full control on the retriever • First stage to prioritize
  • 18. PMI INFORMATION SERVICES 2016 Introducing Elasticsearch • Open source distributed • Highly scalable • RESTful API on top of Lucene capabilities • Support for Full Text search (best of bread) • Easy to configure + extend • Seamlessly manage conflicts • Active community & popular 21
  • 19. PMI INFORMATION SERVICES 2016 Integrating Elasticsearch to DrQA’s pipeline 22 Index
  • 20. PMI INFORMATION SERVICES 2016 Integrating Elasticsearch to DrQA’s pipeline 23 >>>fromdrqa.pipeline importDrQA >>>fromdrqa.retrieverimportElasticDocRanker >>>model= DrQA(reader_model=‘reader_model.mdl’, ranker_config={'class':ElasticDocRanker, 'options':{'elastic_url':'127.0.0.1:9200’, 'elastic_index':'mini’, 'elastic_fields':'content’, 'elastic_field_doc_name':['file','filename’], 'elastic_field_content': 'content’}}) >>>model.process(’Howthe tensioningoftheV-belts shouldbe done?’) Directly point to your server hosting Elastic Enable to search in any fields, e.g. uni-grams, bi- grams, title, metadata, etc…
  • 21. PMI INFORMATION SERVICES 2016 The pipeline performance 24 P@5: 76% 84% P@5 ref.: 78% (DrQA) F1 score: 42% F1 score ref.: 79% (DrQA) • DrQA span +- 10 tokens: 94% of 1st result contains true answer
  • 22. PMI INFORMATION SERVICES 2016 Take aways  Address pain points by combining Document Retrieval with Question Answering  If not answered, it will provide much more granular insights of the data  User elicitation & user experience: a top down approach  End user does not know what to ask 25
  • 23. PMI INFORMATION SERVICES 2016 Future work – Extend pipeline with BERT*  A general-purpose architecture to train models for multiple NLP tasks (sentiment analysis, etc…)  State of the art for SQuAD  Open source, published in Oct. 2018 by Google AI Research  High memory required: GPU with at least 12GB of RAM (Base model)  Enable to multi-language queries 26 *https://arxiv.org/abs/1810.04805, https://github.com/google-research/bert • Add one layer to compute Pstart(“token”) & Pend(“token”) for each tokens • Find the best pair by maximizing Pstart(“token1”) * Pend(“token2”)

Notas do Editor

  1. This includes PMI taking part in the open source community. Check out our github to see the highly popular repos we have contributed to.
  2. According to a report from McKinsey, workers spend 20% percent of their time looking for internal information ...
  3. According to a report from McKinsey, workers spend 20% percent of their time looking for internal information ...
  4. Content The amount of content stored in PMI is ridiculously small compared to what Google crawls everyday. It should therefore be an advantage for PMI Enterprise Search. Traffic While from an infrastructure side, handling more traffic is more challenging, it actually offers more possibilities to capture data related to search queries that can be further used to train machine-learning algorithms to optimize the search experience (e.g. suggest / auto-complete queries and boost related most relevant pages). Google holds a huge advantage on that front. Content Sources At PMI, content is fragmented and stored in many locations. This creates difficulties to implement content crawlers that expose content in different ways (APIs, flat files + DB for meta-data etc.). On the Google side, only one type of content is actually crawled: Web pages (including all attached images etc.). Scale is larger but the variety smaller. Information Management At PMI, there are currently no strong practices around information architecture and management. Therefore, all types of information are mixed, irrespective of their business value for PMI, and relevance for the users. On the Google side, companies invest large amounts of resources into SEO and gain proper visibility in search results Security In an enterprise, access to documents / information are controlled to ensure compliance and protection of sensitive data. This poses a challenge when crawling sources which use ACLs to control access, since those ACLs must be imported to filter out search results that some users should not see. Google only works with public content, which completely removes that constraint. Search Experience Google’s core business relies on effective search and targeted advertisements. They hire the best engineers to work on AI to constantly adjust the quality of the search results. At PMI, there is no such team (even small scale) that is tasked to monitor and continuously improve search relevance. Also, from a skill perspective, it is unrealistic to think it is possible to get even close to Google.