SlideShare uma empresa Scribd logo
1 de 27
A Data Driven Approach to
Query Expansion in
Question Answering
         Leon Derczynski, Robert Gaizauskas,
              Mark Greenwood and Jun Wang

          Natural Language Processing Group
            Department of Computer Science
                    University of Sheffield, UK
Summary
   Introduce a system for QA
   Find that its IR component limits system performance
       Explore alternative IR components
       Identify which questions cause IR to stumble
   Using answer lists, find extension words that make
    these questions easier
   Show how knowledge of these words can make rapidly
    accelerate the development of query expansion methods
   Show why one simple relevance feedback technique
    cannot improve IR for QA
How we do QA
   Question answering system follows a linear
    procedure to get from question to answers

       Pre-processing
       Text retrieval
       Answer Extraction


   Performance at each stage affects later results
Measuring QA Performance
   Overall metrics
       Coverage
       Redundancy
   TREC provides answers
       Regular expressions for matching text
       IDs of documents deemed helpful
   Ways of assessing correctness
       Lenient: the document text contains an answer
       Strict: further, the document ID is listed by TREC
Assessing IR Performance
   Low initial system performance

   Analysed each component in the system

       Question pre-processing correct

       Coverage and redundancy checked in IR part
IR component issues
   Only 65% of questions generate any text to
    be prepared for answer extraction
   IR failings cap the entire system performance
   Need to balance the amount of information
    retrieved for AE
   Retrieving more text boosts coverage, but
    also introduces excess noise
Initial performance
   Lucene statistics
      Question year     Coverage    Redundancy

           2004             63.6%            1.62

           2005             56.6%            1.15

           2006             56.8%            1.18


       Using strict matching, at paragraph level
Potential performance inhibitors
   IR Engine
       Is Lucene causing problems?
       Profile some alternative engines

   Difficult questions
       Identify which questions cause problems
       Examine these:
           Common factors
           How can they be made approachable?
Information Retrieval Engines
   AnswerFinder uses a modular framework, including
    an IR plugin for Lucene
   Indri and Terrier are two public domain IR engines,
    which have both been adapted to perform TREC
    tasks
       Indri – based on the Lemur toolkit and INQUERY engine
       Terrier – developed in Glasgow for dealing with terabyte
        corpora
   Plugins are created for Indri and Terrier, which are
    then used as replacement IR components
   Automated testing of overall QA performance done
    using multiple IR engines
IR Engine performance
        Engine                   Coverage                 Redundancy

Indri                                       55.2%                             1.15
Lucene                                      56.8%                             1.18
Terrier                                     49.3%                             1.00

With n=20; strict retrieval; TREC 2006 question set; paragraph-level texts.
    • Performance between engines does not seem to vary
    significantly
    • Non-QA-specific IR Engine tweaking possibly not a great avenue
    for performance increases
Identification of difficult
questions
   Coverage of 56.8% indicates that for over 40% of questions, no
    documents are found.

   Some questions are difficult for all engines

   How to define a “difficult” question?

   Calculate average redundancy (over multiple engines) for each
    question in a set

   Questions with average redundancy less than a certain threshold
    are deemed difficult

   A threshold of zero is usually enough to find a sizeable dataset
Examining the answer data
   TREC answer data provides hints to what
    documents an IR engine ideal for QA should
    retrieve
       Helpful document lists
       Regular expressions of answers
   Some questions are marked by TREC as
    having no answer; these are excluded from
    the difficult question set
Making questions accessible
   Given the answer bearing documents and answer
    text, it’s easy to extract words from answer-bearing
    paragraphs
   For example, where the answer is “baby monitor”:
       The inventor of the baby monitor found this device
       almost accidentally
   These surrounding words may improve coverage
    when used as query extensions
   How can we find out which extension words are
    most helpful?
Rebuilding the question set
   Only use answerable difficult questions
   For each question:
       Add original question to the question set as a control
       Find target paragraphs in “correct” texts
       Build a list of all words in that paragraph, except: answers,
        stop words, and question words
       For each word:
         Create a sub-question which consists of the original
          question, extended by that word
Rebuilding the question set
Example:

   Single factoid question: Q + E
       How tall is the Eiffel tower? + height


   Question in a series: Q + T + E
       Where did he play in college? + Warren Moon +
        NFL
Do data-driven extensions
help?
   Base performance is at or below the difficult
    question threshold (typically zero)

   Any extension that brings performance above zero
    is deemed a “helpful word”

   From the set of difficult questions, 75% were made
    approachable by using a data-driven extension

   If we can add these terms accurately to questions,
    the cap on answer extraction performance is raised
Do data-driven extensions
help?
   Question Where did he play in college?
   Target    Warren Moon
   Base redundancy is zero

   Extensions
       Football          Redundancy: 1
       NFL               Redundancy: 2.5


   Adding some generic related words improves
    performance
Do data-driven extensions
help?
   Question Who was the nominal leader after the
              overthrow?
   Target    Pakistani government overthrown in 1999
   Base redundancy is zero

   Extensions
       Islamabad         Redundancy: 2.5
       Pakistan          Redundancy: 4
       Kashmir           Redundancy: 4

   Location based words can raise redundancy
Do data-driven extensions
help?
   Question    Who have commanded the division?
   Target      82nd Airborne Division
   Base redundancy is zero
   Question expects a list of answers

   Extensions
       Col                     Redundancy: 2
       Gen                     Redundancy: 3
       officer                 Redundancy: 1
       decimated               Redundancy: 1

   The proper names for ranks help; this can be hinted at by “Who”
   Events related to the target may suggest words
   Possibly not a victorious unit!
Observations on helpful words
   Inclusion of pertainyms has a positive effect
    on performance, agreeing with more general
    observations in Greenwood (2004)
   Army ranks stood out highly
   Use of an always-include list
   Some related words help, though there’s
    often no deterministic relationship between
    them and the questions
Measuring automated
expansion
   Known helpful words are also the target set of words
    that any expansion method should aim for
   Once the target expansions are known, measuring
    automated expansion becomes easier
   No need to perform IR for every candidate
    expanded query (some runs over AQUAINT took up
    to 14 hours on a 4-core 2.3GHz system)
   Rapid evaluation permits faster development of
    expansion techniques
Relevance feedback in QA
   Simple RF works by using features of an initial
    retrieval to alter a query
   We picked the highest frequency words in the
    “initially retrieved texts”, and used them to
    expand a query
   The size of the IRT set is denoted r
   Previous work (Monz 2003) looked at relevance
    feedback using a small range of values for r
   Different sizes of initial retrievals are used, between
    r=5 and r=50
Rapidly evaluating RF
   Three metrics show how a query expansion
    technique performs:
       Percentage of all helpful words found in IRT
         This shows the intersection between words in initially
          retrieved texts, and the helpful words.
       Percentage of texts containing helpful words
         If this is low, then the IR system does not retrieve many
          documents containing helpful words, given the initial query
       Percentage of expansion terms that are helpful
         This is a key statistic; the higher this is, the better
          performance is likely to be
Relevance feedback
predictions
    RF selects some words to be added on to a query, based on an initial search.

                                                 2004         2005       2006

     Helpful words found in IRT                     4.2%      18.6%         8.9%
     IRT containing helpful words                  10.0%      33.3%       34.3%
     RF words that are “helpful”                   1.25%      1.67%       5.71%
    Less than 35% of the documents used in relevance feedback actually
     contain helpful words
    Picking helpful words out from initial retrievals is not easy, when there’s
     so much noise
    Due to the small probability of adding helpful words, relevance feedback
     is likely not to make difficult questions accessible.
    Adding noise to the query will drown out otherwise helpful documents for
     non-difficult questions
Relevance feedback results
     Coverage at n docs r=5   r=50 Baseline
            10          34.7% 28.4% 43.4%
            20          44.4% 39.8% 55.3%
   Only 1.25% - 5.71% of the words that relevance
    feedback chose were actually helpful; the rest only
    add noise
   Performance using TF-based relevance feedback is
    consistently lower than the baseline
   Hypothesis of poor performance is supported
Conclusions
   IR engine performance for QA does not vary
    wildly
   Identifying helpful words provides a tool for
    assessing query expansion methods
   TF-based relevance feedback cannot be
    generally effective in IR for QA
   Linguistic relationships exist that can help in
    query expansion
Any questions?

Mais conteúdo relacionado

Semelhante a A data driven approach to query expansion in question answering

Cec2010 araujo pereziglesias
Cec2010 araujo pereziglesiasCec2010 araujo pereziglesias
Cec2010 araujo pereziglesiasLourdes Araujo
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval forWaheeb Ahmed
 
Question Answering System using machine learning approach
Question Answering System using machine learning approachQuestion Answering System using machine learning approach
Question Answering System using machine learning approachGarima Nanda
 
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...IRJET Journal
 
Répondre à la question automatique avec le web
Répondre à la question automatique avec le webRépondre à la question automatique avec le web
Répondre à la question automatique avec le webAhmed Hammami
 
Answer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic QuestionsAnswer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic QuestionsAhmed Magdy Ezzeldin, MSc.
 
Ontology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemOntology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemIJTET Journal
 
Building a Meta-search Engine
Building a Meta-search EngineBuilding a Meta-search Engine
Building a Meta-search EngineAyan Chandra
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresIJwest
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresdannyijwest
 
Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingShakas Technologies
 
IRJET- Analysis of Question and Answering Recommendation System
IRJET-  	  Analysis of Question and Answering Recommendation SystemIRJET-  	  Analysis of Question and Answering Recommendation System
IRJET- Analysis of Question and Answering Recommendation SystemIRJET Journal
 
Open domain question answering system using semantic role labeling
Open domain question answering system using semantic role labelingOpen domain question answering system using semantic role labeling
Open domain question answering system using semantic role labelingeSAT Publishing House
 
Test-Driven Development in the Corporate Workplace
Test-Driven Development in the Corporate WorkplaceTest-Driven Development in the Corporate Workplace
Test-Driven Development in the Corporate WorkplaceAhmed Owian
 

Semelhante a A data driven approach to query expansion in question answering (20)

Cec2010 araujo pereziglesias
Cec2010 araujo pereziglesiasCec2010 araujo pereziglesias
Cec2010 araujo pereziglesias
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval for
 
Question Answering System using machine learning approach
Question Answering System using machine learning approachQuestion Answering System using machine learning approach
Question Answering System using machine learning approach
 
ISEC-2021-Presentation-Saikat-Mondal
ISEC-2021-Presentation-Saikat-MondalISEC-2021-Presentation-Saikat-Mondal
ISEC-2021-Presentation-Saikat-Mondal
 
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
 
Répondre à la question automatique avec le web
Répondre à la question automatique avec le webRépondre à la question automatique avec le web
Répondre à la question automatique avec le web
 
Answer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic QuestionsAnswer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic Questions
 
JavaScript Unit Testing
JavaScript Unit TestingJavaScript Unit Testing
JavaScript Unit Testing
 
Ontology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemOntology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval System
 
Building a Meta-search Engine
Building a Meta-search EngineBuilding a Meta-search Engine
Building a Meta-search Engine
 
Novel Scoring System for Identify Accurate Answers for Factoid Questions
Novel Scoring System for Identify Accurate Answers for Factoid QuestionsNovel Scoring System for Identify Accurate Answers for Factoid Questions
Novel Scoring System for Identify Accurate Answers for Factoid Questions
 
Decision tables
Decision tablesDecision tables
Decision tables
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
 
STRICT-SANER2017
STRICT-SANER2017STRICT-SANER2017
STRICT-SANER2017
 
Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity ranking
 
IRJET- Analysis of Question and Answering Recommendation System
IRJET-  	  Analysis of Question and Answering Recommendation SystemIRJET-  	  Analysis of Question and Answering Recommendation System
IRJET- Analysis of Question and Answering Recommendation System
 
Open domain question answering system using semantic role labeling
Open domain question answering system using semantic role labelingOpen domain question answering system using semantic role labeling
Open domain question answering system using semantic role labeling
 
A Review on Novel Scoring System for Identify Accurate Answers for Factoid Qu...
A Review on Novel Scoring System for Identify Accurate Answers for Factoid Qu...A Review on Novel Scoring System for Identify Accurate Answers for Factoid Qu...
A Review on Novel Scoring System for Identify Accurate Answers for Factoid Qu...
 
Test-Driven Development in the Corporate Workplace
Test-Driven Development in the Corporate WorkplaceTest-Driven Development in the Corporate Workplace
Test-Driven Development in the Corporate Workplace
 

Mais de Leon Derczynski

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and VeracityLeon Derczynski
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018Leon Derczynski
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceLeon Derczynski
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCLeon Derczynski
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingLeon Derczynski
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social MediaLeon Derczynski
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesLeon Derczynski
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Leon Derczynski
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social MediaLeon Derczynski
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doLeon Derczynski
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsLeon Derczynski
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy DataLeon Derczynski
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyLeon Derczynski
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkLeon Derczynski
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataLeon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 

Mais de Leon Derczynski (20)

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 
RumourEval
RumourEvalRumourEval
RumourEval
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGC
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 

Último

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Último (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

A data driven approach to query expansion in question answering

  • 1. A Data Driven Approach to Query Expansion in Question Answering Leon Derczynski, Robert Gaizauskas, Mark Greenwood and Jun Wang Natural Language Processing Group Department of Computer Science University of Sheffield, UK
  • 2. Summary  Introduce a system for QA  Find that its IR component limits system performance  Explore alternative IR components  Identify which questions cause IR to stumble  Using answer lists, find extension words that make these questions easier  Show how knowledge of these words can make rapidly accelerate the development of query expansion methods  Show why one simple relevance feedback technique cannot improve IR for QA
  • 3. How we do QA  Question answering system follows a linear procedure to get from question to answers  Pre-processing  Text retrieval  Answer Extraction  Performance at each stage affects later results
  • 4. Measuring QA Performance  Overall metrics  Coverage  Redundancy  TREC provides answers  Regular expressions for matching text  IDs of documents deemed helpful  Ways of assessing correctness  Lenient: the document text contains an answer  Strict: further, the document ID is listed by TREC
  • 5. Assessing IR Performance  Low initial system performance  Analysed each component in the system  Question pre-processing correct  Coverage and redundancy checked in IR part
  • 6. IR component issues  Only 65% of questions generate any text to be prepared for answer extraction  IR failings cap the entire system performance  Need to balance the amount of information retrieved for AE  Retrieving more text boosts coverage, but also introduces excess noise
  • 7. Initial performance  Lucene statistics Question year Coverage Redundancy 2004 63.6% 1.62 2005 56.6% 1.15 2006 56.8% 1.18 Using strict matching, at paragraph level
  • 8. Potential performance inhibitors  IR Engine  Is Lucene causing problems?  Profile some alternative engines  Difficult questions  Identify which questions cause problems  Examine these:  Common factors  How can they be made approachable?
  • 9. Information Retrieval Engines  AnswerFinder uses a modular framework, including an IR plugin for Lucene  Indri and Terrier are two public domain IR engines, which have both been adapted to perform TREC tasks  Indri – based on the Lemur toolkit and INQUERY engine  Terrier – developed in Glasgow for dealing with terabyte corpora  Plugins are created for Indri and Terrier, which are then used as replacement IR components  Automated testing of overall QA performance done using multiple IR engines
  • 10. IR Engine performance Engine Coverage Redundancy Indri 55.2% 1.15 Lucene 56.8% 1.18 Terrier 49.3% 1.00 With n=20; strict retrieval; TREC 2006 question set; paragraph-level texts. • Performance between engines does not seem to vary significantly • Non-QA-specific IR Engine tweaking possibly not a great avenue for performance increases
  • 11. Identification of difficult questions  Coverage of 56.8% indicates that for over 40% of questions, no documents are found.  Some questions are difficult for all engines  How to define a “difficult” question?  Calculate average redundancy (over multiple engines) for each question in a set  Questions with average redundancy less than a certain threshold are deemed difficult  A threshold of zero is usually enough to find a sizeable dataset
  • 12. Examining the answer data  TREC answer data provides hints to what documents an IR engine ideal for QA should retrieve  Helpful document lists  Regular expressions of answers  Some questions are marked by TREC as having no answer; these are excluded from the difficult question set
  • 13. Making questions accessible  Given the answer bearing documents and answer text, it’s easy to extract words from answer-bearing paragraphs  For example, where the answer is “baby monitor”: The inventor of the baby monitor found this device almost accidentally  These surrounding words may improve coverage when used as query extensions  How can we find out which extension words are most helpful?
  • 14. Rebuilding the question set  Only use answerable difficult questions  For each question:  Add original question to the question set as a control  Find target paragraphs in “correct” texts  Build a list of all words in that paragraph, except: answers, stop words, and question words  For each word:  Create a sub-question which consists of the original question, extended by that word
  • 15. Rebuilding the question set Example:  Single factoid question: Q + E  How tall is the Eiffel tower? + height  Question in a series: Q + T + E  Where did he play in college? + Warren Moon + NFL
  • 16. Do data-driven extensions help?  Base performance is at or below the difficult question threshold (typically zero)  Any extension that brings performance above zero is deemed a “helpful word”  From the set of difficult questions, 75% were made approachable by using a data-driven extension  If we can add these terms accurately to questions, the cap on answer extraction performance is raised
  • 17. Do data-driven extensions help?  Question Where did he play in college?  Target Warren Moon  Base redundancy is zero  Extensions  Football Redundancy: 1  NFL Redundancy: 2.5  Adding some generic related words improves performance
  • 18. Do data-driven extensions help?  Question Who was the nominal leader after the overthrow?  Target Pakistani government overthrown in 1999  Base redundancy is zero  Extensions  Islamabad Redundancy: 2.5  Pakistan Redundancy: 4  Kashmir Redundancy: 4  Location based words can raise redundancy
  • 19. Do data-driven extensions help?  Question Who have commanded the division?  Target 82nd Airborne Division  Base redundancy is zero  Question expects a list of answers  Extensions  Col Redundancy: 2  Gen Redundancy: 3  officer Redundancy: 1  decimated Redundancy: 1  The proper names for ranks help; this can be hinted at by “Who”  Events related to the target may suggest words  Possibly not a victorious unit!
  • 20. Observations on helpful words  Inclusion of pertainyms has a positive effect on performance, agreeing with more general observations in Greenwood (2004)  Army ranks stood out highly  Use of an always-include list  Some related words help, though there’s often no deterministic relationship between them and the questions
  • 21. Measuring automated expansion  Known helpful words are also the target set of words that any expansion method should aim for  Once the target expansions are known, measuring automated expansion becomes easier  No need to perform IR for every candidate expanded query (some runs over AQUAINT took up to 14 hours on a 4-core 2.3GHz system)  Rapid evaluation permits faster development of expansion techniques
  • 22. Relevance feedback in QA  Simple RF works by using features of an initial retrieval to alter a query  We picked the highest frequency words in the “initially retrieved texts”, and used them to expand a query  The size of the IRT set is denoted r  Previous work (Monz 2003) looked at relevance feedback using a small range of values for r  Different sizes of initial retrievals are used, between r=5 and r=50
  • 23. Rapidly evaluating RF  Three metrics show how a query expansion technique performs:  Percentage of all helpful words found in IRT  This shows the intersection between words in initially retrieved texts, and the helpful words.  Percentage of texts containing helpful words  If this is low, then the IR system does not retrieve many documents containing helpful words, given the initial query  Percentage of expansion terms that are helpful  This is a key statistic; the higher this is, the better performance is likely to be
  • 24. Relevance feedback predictions RF selects some words to be added on to a query, based on an initial search. 2004 2005 2006 Helpful words found in IRT 4.2% 18.6% 8.9% IRT containing helpful words 10.0% 33.3% 34.3% RF words that are “helpful” 1.25% 1.67% 5.71%  Less than 35% of the documents used in relevance feedback actually contain helpful words  Picking helpful words out from initial retrievals is not easy, when there’s so much noise  Due to the small probability of adding helpful words, relevance feedback is likely not to make difficult questions accessible.  Adding noise to the query will drown out otherwise helpful documents for non-difficult questions
  • 25. Relevance feedback results Coverage at n docs r=5 r=50 Baseline 10 34.7% 28.4% 43.4% 20 44.4% 39.8% 55.3%  Only 1.25% - 5.71% of the words that relevance feedback chose were actually helpful; the rest only add noise  Performance using TF-based relevance feedback is consistently lower than the baseline  Hypothesis of poor performance is supported
  • 26. Conclusions  IR engine performance for QA does not vary wildly  Identifying helpful words provides a tool for assessing query expansion methods  TF-based relevance feedback cannot be generally effective in IR for QA  Linguistic relationships exist that can help in query expansion

Notas do Editor

  1. The structure of this talk Examine IR performance cap by: trying out a few different IR engines, and working out which are the toughest questions
  2. We use used a simple, linear clearly defined QA system built at Sheffield, that’s been entered into previous TREC QA track conferences for experimentation There are three steps; processing of a question, including anaphora resolution, perhaps dealing with targets in question series performing some IR to get texts relevant to the question using logic to get a suitable answer out from the retrieved texts Any failures early on will cap the performance of a later component. This gave us a need to assess performance.
  3. Coverage – amount of questions where IR engine brings at returns one document with the answer in Redundancy – number of answer-containing documents found for a question TREC gives answers after each competition; a list of expressions that match answers, and the IDs of documents that judges have found useful. Due to the size of the corpora, these aren’t comprehensive lists; so, it’s easy to get a false negative for (say) redundancy when a document that’s actually helpful but not assessed by TREC turns up. Next, we can match documents in a couple of ways; lenient, where the answer text is found (though the context may be completely wrong), and Strict, where the retrieved doc not only contains the answer text but also is in a document that TREC judges marked as helpful.
  4. AND WE FOUND…
  5. Because of the necessarily linear system design, IR component problems limit AE. If we can’t provide any useful documents, then we’ve little chance of getting the right answer out. Paragraph vs. document level retrieval; paragraph provides less noise, but it hard to do. Document level gives a huge coverage gain, but then chokes answer extraction. Did some work with AE part, found that about 20 paragraphs was right
  6. Coverage between half and two-thirds Rarely more than one in twenty documents are actually useful
  7. Is the problem with our IR implementation? Could another generic component work? We tested a few options. Which questions are tripping us up? Do they have common factors – grammatical features, expected answer type (a person’s name, a date, a number) – is one particular group failing? How can we tap into these tough questions?
  8. Used java Scripted runs in a certain environment – e.g. number of documents to be found Post-process the results of retrievals to score them by a variety of metrics
  9. No noticeable performance changes happened with alternative generic IR components Alternatives seem slightly worse off than the original with this configuration Tuning generic IR parameters seems unlikely to yield large QA performance boosts
  10. There’s still a large body of difficult questions Many are uniformly tough If we’re to examine a concrete set of harder questions, a definition’s required An average redundancy measure, derived from multiple engines and configurations (e.g. para, doc, lenient, strict) is worked out for every question All questions with average redundancy below a threshold are difficult Threshold as low as zero still provides a good sample to work with
  11. To work out how these difficult questions should be answered, we consulted trec answer lists details of useful documents regular expressions for each answer Any unanswerable questions were removed from the difficult list
  12. One we know where the answers are – the documents that have them, and the paragraph inside those documents – we can examine surrounding words for context Using these words as extensions may improve coverage How do we find out if this is true, and which ones help?
  13. Stick to the usable set of questions. OQ readily available for comparison. OQ also acts as a canary for validating IR parameters of a run– if it’s performance isn’t below the difficult question threshold, something’s gone wrong.
  14. We started out by looking at questions that were impossible for answer extraction, because no texts were found for them in the IR stage All extension words that bring useful documents to the fore are useful Three-quarters of tough questions can be made accessible by query extension with context-based words Shows a possibility for lifting the limit on AE performance significantly
  15. Adding the name of the capital of the country in question immediately brought useful documents up Adding the name of the country alongside its adjective also helped
  16. Adding these military-type words is helpful Also adding a term related to events in the target’s past is helpful This unit may not have fared so well during the scope of news articles in the corpora – decimated!
  17. Pertainyms – variations on the parts of speech of a location. E.g., the adjective – describing something from a country – or the title of it. Greenwood 2004 investigates relations between these pertainyms and their effects on search performance Col, Gen both brought the answers up from the index Col, Gen and other titles excluded by some stoplists. Brought in a whitelist of words to make sure these were made into extension candidates Military was also helpful in the 82nd Airborne question
  18. Now we have a set of words that a perfect expansion algorithm should provide. Comparing these with an expansion algorithms output eliminates the need to re-run IR for the whole set of expansion candidates. This sometimes took us over half a day using a reasonable system, so the time saving is considerable.
  19. Basic relevance feedback will execute an intial information retrieval, and use features of this to pick expansion terms. We chose to use a term-frequency based measure, selecting common words from the initially retrieved texts (IRTs) The number of documents examind to find expansion words is ‘r’
  20. Used a trio of metrics. Firstly, the coverage of the terms found in IRTs over the available set of helpful words. Next, the amount of IRTs that had any useful words in at all. For example, retrieving 20 documents, if only 1 has any helpful words, this metric is 5% Finally, the intersection between words chosen for relevance feedback and those that are actually helpful gives a direct evaluation of the extension algorithm.
  21. Examined an initial retrieval to see how helpful the IR data could be. Not many of the helpful words occurred (under 20%) Only around a third of documents contained any useful words – the rest only provided noise. The single figure percentages of the intersection between extensions used and those helpful gives a negative outlook for term frequency based RF. Finally, adding massive amounts of noise – up to 98% for the 2004 set – will push helpful documents out.
  22. Testing this particular relevance feedback method shows that, as predicted by the very low occurrence of helpful words in the extensions, performance was low. In fact, consistenly lower than when using no query extension at all – due to the excess noise introduced. This supports the hypothesis that TF-based RF is not helpful in IR for QA.
  23. The particular implementation using default configurations of general-purpose IR engines isn’t too important. Now we can predict how well an extension algorithm will work without performing a full retrieval. Term frequency based relevance feedback, in the circumstances described, cannot help IR for QA. There are linguistic relationships between query tems and useful query expansions, that with further work can be exploited to raise coverage