SlideShare uma empresa Scribd logo
1 de 29
Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based Approach Carlos M Lorenzetti Ana G Maguitman [email_address] [email_address] Universidad Nacional del Sur Av. L.N. Alem 1253 Bahía Blanca - Argentina Grupo de Investigación en Recuperación de Información y Gestión del Conocimiento Laboratorio de Investigación y Desarrollo en Inteligencia Artificial CONICET AGENCIA
Introduction
Context–Based Search Java?
Context–Based Search Java? Animals Computers Consumables Entertainment Geography Flora Ships
Context–Based Search Context Articles Newspapers Others
Context–Based Search Java? Context Articles Newspapers Others Geography
Query tuning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Different Role of Terms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Different Role of Terms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Recall
Different Role of Terms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Precision Recall
[object Object],[object Object]
Descriptors and Discriminators Java Language Applets Code Topic: Java Virtual Machine NetBeans Computers JVM Ruby Programming JDK Virtual Machine
Descriptors and Discriminators Java Language Applets Code Topic: Java Virtual Machine NetBeans Computers JVM Ruby Programming JDK Virtual Machine Good descriptors
Descriptors and Discriminators Java Language Applets Code Topic: Java Virtual Machine NetBeans Computers JVM Ruby Programming JDK Virtual Machine Good discriminators
Documents Descriptors and Discriminators Number of   occurrences   of term  j  in document  i Topic: Java Virtual Machine Initial Context H ,[object Object],[object Object],[object Object],[object Object],(1)   (2) (3) (4) 0 3 3 0 0 1 2 0 1 0 0 4 2 0 0 4 3 0 0 3 0 2 2 0 1 1 2 0 0 1 1 0 0 2 3 6 2 5 5 2 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java
Documents  Descriptors Topic: Java Virtual Machine Initial Context Descriptive  power of a term in a  document 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java 0,000 0,000 0,000 0,000 0,000 0,539 0,180 0,180 0,359 0,718
Documents  Discriminators Topic: Java Virtual Machine Initial Context Discriminating  power of a term in a  document 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java 0,000 0,000 0,000 0,000 0,000 0,577 0,500 0,577 0,500 0,447
Documents comparison criteria Documents similarity K 1 K 3 K 2 d 2 d 1  Cosine similarity
Topics Descriptors Topic: Java Virtual Machine Initial Context Term  descriptive  power in a topic of a document 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java 0,014 0,032 0,040 0,040 0,055 0,064 0,089 0,124 0,158 0,385
Topics Discriminators Topic: Java Virtual Machine Initial Context Term  discriminating  power in a topic of a document 0 province 0 island 0 coffee 4 java 1 language 2 machine 3 programming 1 virtual 0 jdk 0 jvm 0,385 0,385 0,385 0,493 0,517 0,524 0,566 0,566 0,848 0,848
Proposed Algorithm Context w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w m-1 w m w m-2 w 9 . . . Roulette query 01 query 02 query 03 query n result 03 result 01 result 02 result n w  0,5 w  0,25 . . . w  0,1 1 2 m DESCRIPTORS DESCRIPTORS w  0,4 w  0,37 . . . w  0,01 1 2 m DISCRIMINATORS DISCRIMINATORS 1 2 4 3 Terms
[object Object]
Evaluation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],1 st  level 2 nd   level 3 rd   level Top Home Science Arts Cooking Family Childcare
Evaluation –   N  Similarity   Context update Top/Computers/Open_Source/Software Query formulation and retrieval process 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 120 140 160 180 iteration novelty-driven similarity Maximum Average Minimum [0.5866; 0.6073] 0.5970 best [0.0618; 0.0704] 0.0661 1 st 95% CI Mean  N
Evaluation –   N  Similarity [0.0822; 0.0924] 0.087 Baseline [0.5866; 0.6073] 0.597 Incremental [0.0710; 0.0803] 0.075 Bo1-DFR 95% CI Mean  N
Evaluation – Precision   Incremental (66.96%) Bo1-DFR (24.33%) Baseline (8.7%)  [0.2461; 0.2863] 0.266 Baseline [0.3325; 0.3764] 0.354 Incremental [0.2859; 0.3298] 0.307 Bo1-DFR 95% CI Mean Precision
Evaluation – Semantic Precision   Incremental (65.18%) Bo1-DFR (27.90%) Baseline (6.92%)  [0.5383; 0.5679] 0.553 Baseline [0.6068; 0.6372] 0.622 Incremental [0.5750; 0.6066] 0.590 Bo1-DFR 95% CI Mean Precision S
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Thank you! CONICET AGENCIA Laboratorio de Investigación y Desarrollo en Inteligencia Artificial lidia.cs.uns.edu.ar Universidad Nacional del Sur Bahía Blanca www.uns.edu.ar

Mais conteúdo relacionado

Destaque

Session 3: Vocabulary enrichment, Gerda Koch
Session 3: Vocabulary enrichment, Gerda KochSession 3: Vocabulary enrichment, Gerda Koch
Session 3: Vocabulary enrichment, Gerda Kochlocloud
 
Ctl ( contextual teaching and learning )
Ctl ( contextual teaching and learning )Ctl ( contextual teaching and learning )
Ctl ( contextual teaching and learning )Sary Nieman
 
Series 16 -Attachment 4 -Momin Chetamani -Gujarati
Series 16  -Attachment 4 -Momin Chetamani -GujaratiSeries 16  -Attachment 4 -Momin Chetamani -Gujarati
Series 16 -Attachment 4 -Momin Chetamani -GujaratiSatpanth Dharm
 
Series 27 reply to patidar sandesh's editorial dt. 10-oct-2010 -d
Series 27  reply to patidar sandesh's editorial dt. 10-oct-2010 -dSeries 27  reply to patidar sandesh's editorial dt. 10-oct-2010 -d
Series 27 reply to patidar sandesh's editorial dt. 10-oct-2010 -dSatpanth Dharm
 
Speed Of Light
Speed Of LightSpeed Of Light
Speed Of LightOmar Touil
 
Executive summary gen peace ga
Executive summary gen peace gaExecutive summary gen peace ga
Executive summary gen peace gaGenPeace
 
Presentación 5 diapositivas
Presentación 5 diapositivasPresentación 5 diapositivas
Presentación 5 diapositivasteresa35
 
My Experience to Be Studentpreneur
My Experience to Be StudentpreneurMy Experience to Be Studentpreneur
My Experience to Be StudentpreneurArry Rahmawan
 
おしゃれCatalystに触ってみた
おしゃれCatalystに触ってみたおしゃれCatalystに触ってみた
おしゃれCatalystに触ってみたtomohiro morishita
 
Sugarraren bidezko analisia
Sugarraren bidezko analisiaSugarraren bidezko analisia
Sugarraren bidezko analisiablackitsas
 
Oe3 imam shah bava sathe ni varta laap
Oe3  imam shah bava sathe ni varta laapOe3  imam shah bava sathe ni varta laap
Oe3 imam shah bava sathe ni varta laapSatpanth Dharm
 
It's All About Context
It's All About ContextIt's All About Context
It's All About ContextKevin Suttle
 
Series 5 covering email contents
Series 5  covering email contentsSeries 5  covering email contents
Series 5 covering email contentsSatpanth Dharm
 

Destaque (18)

Session 3: Vocabulary enrichment, Gerda Koch
Session 3: Vocabulary enrichment, Gerda KochSession 3: Vocabulary enrichment, Gerda Koch
Session 3: Vocabulary enrichment, Gerda Koch
 
Ctl ( contextual teaching and learning )
Ctl ( contextual teaching and learning )Ctl ( contextual teaching and learning )
Ctl ( contextual teaching and learning )
 
Series 16 -Attachment 4 -Momin Chetamani -Gujarati
Series 16  -Attachment 4 -Momin Chetamani -GujaratiSeries 16  -Attachment 4 -Momin Chetamani -Gujarati
Series 16 -Attachment 4 -Momin Chetamani -Gujarati
 
Series 27 reply to patidar sandesh's editorial dt. 10-oct-2010 -d
Series 27  reply to patidar sandesh's editorial dt. 10-oct-2010 -dSeries 27  reply to patidar sandesh's editorial dt. 10-oct-2010 -d
Series 27 reply to patidar sandesh's editorial dt. 10-oct-2010 -d
 
Speed Of Light
Speed Of LightSpeed Of Light
Speed Of Light
 
Asthma
AsthmaAsthma
Asthma
 
Executive summary gen peace ga
Executive summary gen peace gaExecutive summary gen peace ga
Executive summary gen peace ga
 
Presentación 5 diapositivas
Presentación 5 diapositivasPresentación 5 diapositivas
Presentación 5 diapositivas
 
Find A Church Update Guide
Find A Church Update GuideFind A Church Update Guide
Find A Church Update Guide
 
My Experience to Be Studentpreneur
My Experience to Be StudentpreneurMy Experience to Be Studentpreneur
My Experience to Be Studentpreneur
 
おしゃれCatalystに触ってみた
おしゃれCatalystに触ってみたおしゃれCatalystに触ってみた
おしゃれCatalystに触ってみた
 
GCULearn Roadshow
GCULearn RoadshowGCULearn Roadshow
GCULearn Roadshow
 
Sugarraren bidezko analisia
Sugarraren bidezko analisiaSugarraren bidezko analisia
Sugarraren bidezko analisia
 
Python virenv
Python virenvPython virenv
Python virenv
 
A04n50 claudia feres
A04n50 claudia feresA04n50 claudia feres
A04n50 claudia feres
 
Oe3 imam shah bava sathe ni varta laap
Oe3  imam shah bava sathe ni varta laapOe3  imam shah bava sathe ni varta laap
Oe3 imam shah bava sathe ni varta laap
 
It's All About Context
It's All About ContextIt's All About Context
It's All About Context
 
Series 5 covering email contents
Series 5  covering email contentsSeries 5  covering email contents
Series 5 covering email contents
 

Semelhante a Tuning Topical Queries through Context Vocabulary Enrichment: A Corpus-Based Approach

LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureAggregage
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query UnderstandingAbhay Prakash
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsPaul Hofmann
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Trey Grainger
 
Sam zhang week2demo copy
Sam zhang week2demo copySam zhang week2demo copy
Sam zhang week2demo copyChentao Zhang
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphTrey Grainger
 
Modware
ModwareModware
Modwarebosc
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingRayhan Ferdous
 
The Semantic Web - This time... its Personal
The Semantic Web - This time... its PersonalThe Semantic Web - This time... its Personal
The Semantic Web - This time... its PersonalMark Wilkinson
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays
 

Semelhante a Tuning Topical Queries through Context Vocabulary Enrichment: A Corpus-Based Approach (20)

DL'12 mastro at work
DL'12 mastro at workDL'12 mastro at work
DL'12 mastro at work
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
 
Sam zhang demo
Sam zhang demoSam zhang demo
Sam zhang demo
 
SamKK
SamKKSamKK
SamKK
 
SamBO
SamBOSamBO
SamBO
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query Understanding
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Presentation
PresentationPresentation
Presentation
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & Statistics
 
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disamb...
 
Sam zhang week2demo copy
Sam zhang week2demo copySam zhang week2demo copy
Sam zhang week2demo copy
 
The Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge GraphThe Relevance of the Apache Solr Semantic Knowledge Graph
The Relevance of the Apache Solr Semantic Knowledge Graph
 
Modware
ModwareModware
Modware
 
Qtp syllabus
Qtp syllabus Qtp syllabus
Qtp syllabus
 
Complete java
Complete javaComplete java
Complete java
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to Reporting
 
SIGIR 2011
SIGIR 2011SIGIR 2011
SIGIR 2011
 
The Semantic Web - This time... its Personal
The Semantic Web - This time... its PersonalThe Semantic Web - This time... its Personal
The Semantic Web - This time... its Personal
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
 

Último

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Tuning Topical Queries through Context Vocabulary Enrichment: A Corpus-Based Approach

  • 1. Tuning Topical Queries through Context Vocabulary Enrichment: a Corpus-Based Approach Carlos M Lorenzetti Ana G Maguitman [email_address] [email_address] Universidad Nacional del Sur Av. L.N. Alem 1253 Bahía Blanca - Argentina Grupo de Investigación en Recuperación de Información y Gestión del Conocimiento Laboratorio de Investigación y Desarrollo en Inteligencia Artificial CONICET AGENCIA
  • 4. Context–Based Search Java? Animals Computers Consumables Entertainment Geography Flora Ships
  • 5. Context–Based Search Context Articles Newspapers Others
  • 6. Context–Based Search Java? Context Articles Newspapers Others Geography
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. Descriptors and Discriminators Java Language Applets Code Topic: Java Virtual Machine NetBeans Computers JVM Ruby Programming JDK Virtual Machine
  • 13. Descriptors and Discriminators Java Language Applets Code Topic: Java Virtual Machine NetBeans Computers JVM Ruby Programming JDK Virtual Machine Good descriptors
  • 14. Descriptors and Discriminators Java Language Applets Code Topic: Java Virtual Machine NetBeans Computers JVM Ruby Programming JDK Virtual Machine Good discriminators
  • 15.
  • 16. Documents Descriptors Topic: Java Virtual Machine Initial Context Descriptive power of a term in a document 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java 0,000 0,000 0,000 0,000 0,000 0,539 0,180 0,180 0,359 0,718
  • 17. Documents Discriminators Topic: Java Virtual Machine Initial Context Discriminating power of a term in a document 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java 0,000 0,000 0,000 0,000 0,000 0,577 0,500 0,577 0,500 0,447
  • 18. Documents comparison criteria Documents similarity K 1 K 3 K 2 d 2 d 1  Cosine similarity
  • 19. Topics Descriptors Topic: Java Virtual Machine Initial Context Term descriptive power in a topic of a document 0 jdk 0 jvm 0 province 0 island 0 coffee 3 programming 1 language 1 virtual 2 machine 4 java 0,014 0,032 0,040 0,040 0,055 0,064 0,089 0,124 0,158 0,385
  • 20. Topics Discriminators Topic: Java Virtual Machine Initial Context Term discriminating power in a topic of a document 0 province 0 island 0 coffee 4 java 1 language 2 machine 3 programming 1 virtual 0 jdk 0 jvm 0,385 0,385 0,385 0,493 0,517 0,524 0,566 0,566 0,848 0,848
  • 21. Proposed Algorithm Context w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w m-1 w m w m-2 w 9 . . . Roulette query 01 query 02 query 03 query n result 03 result 01 result 02 result n w 0,5 w 0,25 . . . w 0,1 1 2 m DESCRIPTORS DESCRIPTORS w 0,4 w 0,37 . . . w 0,01 1 2 m DISCRIMINATORS DISCRIMINATORS 1 2 4 3 Terms
  • 22.
  • 23.
  • 24. Evaluation –  N Similarity Context update Top/Computers/Open_Source/Software Query formulation and retrieval process 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 120 140 160 180 iteration novelty-driven similarity Maximum Average Minimum [0.5866; 0.6073] 0.5970 best [0.0618; 0.0704] 0.0661 1 st 95% CI Mean  N
  • 25. Evaluation –  N Similarity [0.0822; 0.0924] 0.087 Baseline [0.5866; 0.6073] 0.597 Incremental [0.0710; 0.0803] 0.075 Bo1-DFR 95% CI Mean  N
  • 26. Evaluation – Precision   Incremental (66.96%) Bo1-DFR (24.33%) Baseline (8.7%)  [0.2461; 0.2863] 0.266 Baseline [0.3325; 0.3764] 0.354 Incremental [0.2859; 0.3298] 0.307 Bo1-DFR 95% CI Mean Precision
  • 27. Evaluation – Semantic Precision   Incremental (65.18%) Bo1-DFR (27.90%) Baseline (6.92%)  [0.5383; 0.5679] 0.553 Baseline [0.6068; 0.6372] 0.622 Incremental [0.5750; 0.6066] 0.590 Bo1-DFR 95% CI Mean Precision S
  • 28.
  • 29. Thank you! CONICET AGENCIA Laboratorio de Investigación y Desarrollo en Inteligencia Artificial lidia.cs.uns.edu.ar Universidad Nacional del Sur Bahía Blanca www.uns.edu.ar

Notas do Editor

  1. Context-based search is the process of seeking information related to a user’s thematic context. Meaningful automatic context-based search can only be achieved if the semantics of the terms in the context under analysis is reflected in the search queries. For example, if a user is searching using their own words ...
  2. He or she could find a lot of topics related to this search.
  3. An information request is usually initiated or generated within a task. For example, if the user is editing or reading a document on a specific topic, he may be willing to explore new material related to that topic. Topical queries can be formed using small sets of terms from the user’s context.
  4. And in this way we could disambiguate a query that belongs to more than one topic.
  5. Query tuning is usually achieved by replacing or extending the terms of a query, or by adjusting the weights of a query vector. Relevance feedback is a query refinement mechanism used to tune queries based on the relevance assessments of the query’s results. A driving hypothesis for relevance feedback methods is that it may be difficult to formulate a good query when the collection of documents isn’t known in advance, but it’s easy to judge particular documents, and so it makes sense to engage in an iterative query refinement process. A typical relevance feedback scenario will involve the following steps: A query is formulated the system returns an initial set of results a relevance assessment on the returned results is issued (by relevance feedback) The system computes a better representation of the information needs based on this feedback. The system returns a revised set of results. Depending on the level of automation of step 3 we can distinguish three forms of feedback: Supervised Feedback requires explicit feedback, which is typically obtained from users who indicate the relevance of each of the retrieved documents. Unsupervised Feedback applies blind relevance feedback, and typically assumes that the top k documents returned by a search process are relevant. And in Semi-supervised Feedback, the relevance of a document is inferred by the system. A common approach is to monitor the user behavior (e.g., which documents are selected for viewing or time spent viewing a document). Provided that the information seeking process is performed within a thematic context, and another automatic way to infer the relevance of a document is by computing the similarity of the document to the user’s current context. We’ll use it in this work.
  6. Much work has addressed the problem of computing the informativeness of a term across a corpus and a good deal of research has focused on computing the descriptive and discriminating power of a term in a document with respect to a corpus. All this work, however, has been done based on a predefined collection of documents and independently from a thematic context. In a previous work we proposed to study the descriptive and discriminating power of a term based on its distribution across the topic of pages returned by a search engine. To distinguish between topic descriptors and discriminators we argue that good topic descriptors can be found by looking for terms that occur often in documents related to the given topic. On the other hand, good topic discriminators can be found by looking for terms that occur only in documents related to the given topic. Both topic descriptors and discriminators are important as query terms.
  7. Because topic descriptors occur often in relevant pages, using them as query terms may improve recall.
  8. Similarly, good topic discriminators occur primarily in relevant pages, and therefore using them as query terms may improve precision.
  9. Now we’ll see a simple example to assess the potential of this kind of terms.
  10. For example, if we choose the topic: Java Virtual Machine, we could take the following words in our context :
  11. So, intuitively, and in sense with the definition, we could say that Good Descriptors would be words suchs as Java, Machine or Virtual, And …
  12. Good discriminators would be: JVM and JDK.
  13. More precisely we’ll see a practical example. As we see in this slide, we build a matrix of documents against terms. Our initial context is the first column of that matrix and the next columns are the pages that we could obtain through a search engine making queries with the initial context’s terms. By definition, each cell of the matrix represents the ocurrences of a term in a document. In this example we have four pages, two of them about the Island and Coffee and the rest about Java as a programming language.
  14. We define the Descriptive power of term in a document with this expression and we can see the values of the terms. We see that the values of the terms that don’t belong to the initial context are zero.
  15. Also we define the Discriminating power of a term in a document with this other expression and see the results. As our objective is to learn the user needs , instead of extracting the descriptors and discriminator of documents (like the user context) we need to find user context topic descriptors and discriminators. This term identification needs an Incremental Method that identifies which documents are similar to the user context. So, we need …
  16. A document comparison criteria and we choose Cosine Similarity. It uses the most simple way to compare documents and it’s the most common method in IR. We don’t explain this method here, but with this criteria we define the topic notion. A topic will be a group of documents with a high cosine similarity.
  17. Using the previous definitions, we define the Term descriptive power in a topic of a document using this equation. We see again the weights reached by every term and we note that Java and Machine are good topic descriptors as we mentioned before.
  18. Also we define the notion of Term discriminating power in a topic of a document, and we note one of the most important things. Terms like JVM and JDK, which don’t belong to the initial context are excellent topic discriminators, as we thought before. Incremental search methods are useful for collecting information from diverse information sources. The incremental identification of context-specific terms can guide the search process through huge repositories of potentially useful material, helping to filter irrelevant content.
  19. Our proposal is to approximate the terms' descriptive and discriminating power for the thematic context under analysis with the purpose of generating good queries . Our approach adapts the typical relevance feedback mechanism to account for a thematic context as follows: First, we extract terms from the user context. With these terms we make queries and the system returns an initial set of results. Simultaneously, with the obtained results and the context the descriptors and discriminators lists are built. These steps are repeated until no improvements are observed. Then, the context characterization is updated with new learned material and the process starts again. The system monitors the effectiveness achieved at each iteration and we use novelty-driven similarity as an estimation of the retrieval effectiveness (we’ll explain it later). If after a number of trials the retrieval effectiveness has not crossed a given threshold (that is, no significant improvements are observed after certain number of trials), the system forces to explore new potentially useful regions of the vocabulary landscape and it can be regarded as a vocabulary leap, which can be thought of as a significant transformation (typically an improvement) of the context characterization. ----------------------------------------------------------------------------------- Step 1: A query is formulated based on C . Step 2: The system returns an initial set of results. Step 3: Repeat for at least v iterations or until no improvements are registered Step 3.1: A relevance assessment on the returned results is issued based on C . Step 3.2: After a certain number of trials and depending on the relevance assessments, the system computes a better representation of the thematic context (phase change). Step 3.3: The system formulates new queries and returns a revised set of results. In order to learn better characterizations of the thematic context, the system undergoes a series of phases. At the end of each phase, the context characterization is updated with new learned material. Each phase evolves through a sequence of trials, where each trial consists in the formulation of a set of queries, the analysis of the retrieved results, the adjustment of the terms' weights, and the discovery of new potentially useful terms. To form queries during phase i we implemented a roulette selection mechanisms where the probability of choosing a particular term t to form a query is proportional to (weight of the term at phase i). Roulette selection is a technique typically used by Genetic Algorithms to choose potentially useful solutions for recombination, where the fitness level is used to associate a probability of selection. This approach resulted in a non-deterministic exploration of term space that favored the fittest terms. The system monitors the effectiveness achieved at each iteration. In our approach we use novelty-driven similarity as an estimation of the retrieval effectiveness (we’ll explaine it later). If after a number of trials the retrieval effectiveness has not crossed a given threshold (i.e., no significant improvements are observed after certain number of trials), the system forces a phase change to explore new potentially useful regions of the vocabulary landscape. A phase change can be regarded as a vocabulary leap, which can be thought of as a significant transformation (typically an improvement) of the context characterization.
  20. Now, we’ll see an evaluation of the proposed solution
  21. We compare the proposed method against two other methods. The first is a baseline that submits queries directly from the thematic context and doesn’t apply any refinement mechanism. The second method used for comparison is the Bo1-DFR method, which is based on the well known Rocchio method. To perform our tests we used 448 topics from the Open Directory Project (ODP) and a number of constraints were imposed on this selection with the purpose of ensuring the quality of our test set. The topics were selected from the third level of the ODP taxonomy, the minimum size for each selected topic was 100 pages and the language was restricted to English. For each topic we collected all of its URLs as well as those in its subtopics. The total number of collected pages was more than 350,000. In our tests we used the ODP description of each selected topic to create an initial context description. In order to compare the implemented methods we used three measures of query performance: Novelty-driven similarity is based on Cosine Similarity but disregards the terms that form the query, overcoming the bias introduced by those terms and favoring the exploration of new material. Precision measures the fraction of retrieved documents which are known to be relevant. The relevant set for each analyzed topic was set as the collection of its URLs as well as those in its subtopics. And Semantic Precision is a measure that considers the inter-topics similarity because other topics in the ontology could be semantically similar (and therefore partially relevant) to the topic of the given context. To compute it we used a semantic similarity measure for generalized ontologies proposed in a previous work.
  22. As an example of the algorithm behaviour we show here its evolution in a representative topic. We see in this chart the novelty-driven similarity evolution and its behaviour at the different execution steps.
  23. Now, in this chart we see the novelty-driven similarity again but in this case in a comparative chart. Each of the 448 topics is represented by a point. It’s interesting to note that for all the tested cases the incremental method was superior to other two methods.
  24. Here, we see the comparison for the Precision metric. In this case the incremental method was strictly superior for 66.96% of the evaluated topics.
  25. Finally, we see the Semantic Precision metric. Here the incremental method was superior in 65.18% of the topics.
  26. The vocabulary problem is a main challenge in human-system communication. In this work we propose a solution to the semantic sensitivity problem, that is the limitation that arises when documents with similar context but different term vocabulary won't be associated, resulting in a false negative match. Our method operates by incrementally learning better vocabularies from a large external corpus such as the Web. We have shown that by implementing an incremental context refinement method we can perform better than a baseline method, which submits queries directly from the initial context, and to the Bo1-DFR method, which doesn't refine queries based on context. This points to the usefulness of simultaneously taking advantage of the terms in the current thematic context and an external corpus to learn better vocabularies and to automatically tune queries.