This document outlines a presentation on web query expansion based on association rules mining with eHownet and a Google Chrome extension. It introduces the background and purpose, which is to improve misuse of Chinese synonyms in queries by providing suggested keywords. The related works section discusses eHownet and the Apriori algorithm. The system architecture and experimental results are also outlined.
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web query expansion based on association rules mining with e hownet and google chrome extension (release)
1. Web query expansion based on
association rules mining with
eHownet and Google chrome
extension
楊曜年 Paul Yang
2. Outline
Introduction
◦ Background
◦ Purpose
Related works
◦ eHownet
◦ Aprori algorithm
◦ Relevant research
Aprori-based query expansion for Chinese IR
◦ System arch
◦ Chinese Word Segmentation / Feature selection
◦ Query Expansion by eHownet
◦ Aprori-based noise word filter
Experimental results
Conclusions and future works
4. Background
According to the marketing survey report (中文搜索引擎
存在问题, 2009, 北京正望咨询; 搜索引擎市場調查報告專題, 新浪網)
In Google and Baidu, 17% and 24% of user can’t find
the web-pages they want
58.6% of user just checks the first few pages and skips
the later pages
50% of user has a little or no knowledge background
for topic they’re going to query
5. Background - cont
Why Search Engines Fail to Search ?
Relevant Documents?
• Users do not give sufficient number of
keywords
Ex. T1 (query) T1 + T2 + T3 (expect to see)
• Users do not give good keywords
◦ Vocabulary gaps
◦ Lack of domain knowledge
Ex. T1 + T2 (query) T3 + T4 (expect to see)
6. Background - cont
Users, particularly, children, may be suffered with
the problem in Google because of misusing Chinese
synonyms as search query to cause a decrease in
precision
Ex
Wanna find”深夜食堂” but misuse ” 深夜酒家” as query
7. Background - cont
the search engine like Google that claims it has
many text mining techs !
8. Background - cont
The actual test on Google (10 results per page, total 4 pages )
Ex 1
• User wants to find “隱形飛機 ”but uses “私密飛行器”as query, the results
shows its precision rate only reaches 5% (2/40)
• If we evaluate based on whether user can get the result in the first page, the
precision rate reaches 0% (2/40)
Ex 2
• find”深夜食堂” but use ” 深夜酒家” as query, the results shows its precision
rate only reaches 4%
• Get NONE of the related results in first page 0% ( 0/10)
9. Background - cont
Solutions to improve this problem:
Global methods
◦ Query expansion/reformulation
Thesauri (ex. WordNet)
Automatic thesaurus generation
Local methods
◦ Relevance feedback
◦ Pseudo relevance feedback
10. Background - cont
Based on the our observation, compared with English
WorldNet (109,000 synonym sets), Chinese
WorldNet provides insufficient info for query
expansion.
11. Background - cont
Another major problem if we use the Thesauri like
WordNet for query expansion
“too many noise words which cannot be found in search engine”
Ex. “私密飛行器” as an input for query expansion
After expansion :
(a1 a2 a3, + b1, b2,……,b5)
秘密 鬼祟 暗地 隱蔽 隱形 私密
噴氣機 座機 飛行器 飛機
隱秘
13. Purpose
The idea of this paper is to:
Improve misuse of Chinese synonym by giving user our
suggested keywords based on Google
browser extension using CKIP’s eHownet for query
expansion and data mining algorithm “Aprori” to analyze
the retrieved web-pages to get the association rules for
filtering the noise word to improve the overall precision.
15. 廣義知網知識本體(Extended-
HowNet Ontology)
E-HowNet is an entity-relation model for
lexical semantic representation extended
from HowNet.
{clothing|衣物} [衣衫]
– 鞋子|shoes [木屐, 木鞋,球鞋, 溜冰鞋, 靴子]
– 褲子|trousers [褲子, 運動褲]
– 內衣|underwear [內衣]
– 禮服|ceremonial robe/dress [禮服, 白紗,婚
紗]
16. The Apriori Algorithm
An influential algorithm for mining frequent itemsets for boolean
association rules.
Key Concepts :
• Frequent Itemsets: The sets of item which has minimum support
(denoted by Li for ith-Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is
generated by joining Lk-1 with itself.
State University of New York, Stony
Brook 16
17. Web Query Expansion by WordNet
Zhiguo Gong, Chan Wa Cheang, and Leong Hou U 2005
Aim for a Web Image queries
Web query expansion by using WordNet and
TSN to extend the scope of the original query
TSN works as Keyword Filtering based on
Aprori
Relevant researches
26. Query expansion through ehownet
May use query of “秘密噴氣機”for “隱形飛機”
• After Chinese word segment: 秘密(VH) 噴氣機(Na )
• Query eHowNet by “秘密” and “噴氣機”
27. Feature selection
• Use Jieba (结巴中文分词) with 繁體 dictionary rather
than CKIP due to performance and prototype
integration concern
• Extract only Verb, noun and ADJ of POS tag from
every sentence
深夜/a 食堂/n 维基百科/nz 自由/a 的/uj 百科全书/nz
深夜/a 食堂/n 深夜/a 食堂/n 漫画/n 深夜/t 食堂/n 在线漫画/n 动漫/n 之家/r 漫画
网/n
EX.
28. Feature selection - cont
• Use TF/IDF to pick the words with high
values, which is able to filter some “綴詞”
ex. 年、月、日、你、我、他、段
29. Association Rules from webpage using
Aprori
隱形 戰機 维基百科
自由 百科全书 维基
百科 飛機
Webpage A
美 部署 隱形 飛機
對抗 中國
Webpage B
隱形 飛機 是 降低
飛機 電 光 聲 可 探
測 特徵 使 雷達 探
測器
Webpage C
隱形 飛機
Curial info (Rules)
30. Noise word filter
Before the filtering process :
After the filtering process (min_support: 0.1 & mid_confidence:
0.75) :
秘密 鬼祟 暗地 隱蔽 隱形 私密
噴氣機 座機 飛行器 飛機
隱秘
隱形飛機 秘密飛機
32. Experiment Setup
Simulate that user may use imprecise query
Base on 4 topics (隱形飛機, 飢餓遊戲, 威力
彩, 深夜食堂) to download the webpage from
Google (24 pages per possible query, totally 1320
pages) and manually review all pages to count the
precision against 4 concepts
Use solely precision measure, no recall and F-
measure to estimate the performance.
34. Experiment Setup - cont
The scenarios to validate:
Users have pre-knowledge to concept and
just uses imprecise word as query.
Users have a few pre-knowledge but can’t
determine query by our suggesting terms
35. Experiment result – First
Hit rate: 100% (隱形飛機 , 飢餓遊戲 威力彩
深夜食堂) , the correct terms are included
after query expand .
可能誤用的查詢字 神力彩 挨餓遊憩 暗地座機 深更酒家
Aprori 過濾後的建議結
果
威力彩, 威
力顏色, 神
力色彩,
神力彩色,
威力彩色
飢餓玩樂,
挨餓遊
戲,挨餓嬉
鬧,
挨餓遊憩
飢餓遊
戲,飢餓嬉
遊
隱形飛機,
秘密飛機
夜深酒家,
深夜食堂,
深夜飲食店,
夜深食堂
36. Experiment result – Second
0%
10%
20%
30%
40%
50%
60%
隱形飛機 深夜食堂 飢餓遊戲 威力彩
Before
After
Average precision
38. Summary
For user having a few knowledge, the query expansion can let user have
more option to choose and modify its imprecise query.
Query expansion with online dictionary ehownet + noise filter improves
the average precision around 20%
Improve the keyword set containing many topics and concepts
Use the validated dataset of data-mining with the full search engine
function to validate based on precision/recall measure
Improve the case that user may misuse spoken word (口語字) by other
dictionary
Improve the mining performance by other algorithms like FP-growth
根據以上 我們可以將此問題簡化為兩種情況 並以圖四的3D向量空間來表示一, 使用者沒有給予搜索引擎足夠的關鍵字 Query = T1 + T2 ExpectToSee = T1 + T2 + T3 搜索引擎產生滿足T1 + T2的所有的網頁 但使用者預期得到T1 + T2 + T3的結果Query所涵蓋的向量面積太大 只會得到非常低的Precision及Recall的結果 二, 使用者使用不夠精確的中文關鍵字 Query = T1 + T2+ T3 ExpectToSee = T3 + T4 + T5 搜索引擎產生滿足T1 + T2 的網頁結果 但使用者預期得到T3+T4 其中Query與預查詢的相關文件之間的向量距離太大 故只會得到非常低的Precision及Recall的結果Vocabulary gaps / Diversity & Vastness of webLack of domain knowledge
Query Understanding Gets to the deeper meaning of the words you type.like Query Understanding, Spelling correction, Synonyms, 以Google而言(即使搜尋演算法 標榜具同義字修正) 但使用者可能因為背景知識不足或中文詞彙使用不夠精確的情況去誤用中文同義字詞 往往得到不正確或只有少數符合查詢的結果(http://www.google.com/intl/zh-Hant/insidesearch/howsearchworks/algorithms.html)
Query Understanding Gets to the deeper meaning of the words you type.like Query Understanding, Spelling correction, Synonyms, 以Google而言(即使搜尋演算法 標榜具同義字修正) 但使用者可能因為背景知識不足或中文詞彙使用不夠精確的情況去誤用中文同義字詞 往往得到不正確或只有少數符合查詢的結果(http://www.google.com/intl/zh-Hant/insidesearch/howsearchworks/algorithms.html)
Relevance feedback and query expansionIn most collections, the same concept may be referred to using different words. This issue, known as synonymy , has an impact on the recall of most information retrieval systems. For example, you would want a search for aircraft to match plane (but only for references to an airplane, not a woodworking plane), and for a search on thermodynamics to match references to heat in appropriate discussions. Users often attempt to address this problem themselves by manually refining a query, as was discussed in Section 1.4 ; in this chapter we discuss ways in which a system can help with query refinement, either fully automatically or with the user in the loop.The methods for tackling this problem split into two major classes: global methods and local methods. Global methods are techniques for expanding or reformulating query terms independent of the query and results returned from it, so that changes in the query wording will cause the new query to match other semantically similar terms. Global methods include:Query expansion/reformulation with a thesaurus or WordNet (Section 9.2.2 )Query expansion via automatic thesaurus generation (Section 9.2.3 )Techniques like spelling correction (discussed in Chapter 3 )Local methods adjust a query relative to the documents that initially appear to match the query. The basic methods here are:Relevance feedback (Section 9.1 )Pseudo relevance feedback, also known as Blind relevance feedback (Section 9.1.6 )(Global) indirect relevance feedback (Section 9.1.7 )Query refinement techniques such asquery expansion, query suggestion, relevance feedback improve rankingA thesaurus provides information on synonyms and semantically related words and phrases.Example:physician syn: ||croaker, doc, doctor, MD, medical, mediciner, medico, ||sawbonesrel: medic, general practitioner, surgeon,
The original query “computer”, for instance, maybe expanded to include “client, server, website, etc.” In other words, with thoseexpanded words together, the system could raise both the query precision and recall.In recent years, huge amount of information is posted on the Web and it continues toincrease with an explosive speed. But we cannot access to the information or use itefficiently and effectively unless it is well organized and indexed. Many searchengines have been created for this need in current years. Web users, however, usuallysubmit only one single word as their queries on the Web [5], especially for a WebImage queries. It is even worse that the users’ query words may be quite different tothe ones used in the documents in describing the same semantics. That means a gapexists between user’s query space and document representation space. This problemresults in lower precisions and recalls of queries. The user may get an overwhelmingbut large percent of irrelevant documents in the result set. In fact, this is a toughproblem in Web information retrieval. An effective method for solving the aboveproblems is query expansion. In this paper, we provide a novel query expansionmethod based on the combination of WordNet [2], an online lexical system, and TSN,a term semantic network extracted from the collection. Our method has beenemployed in our Web image search system [4].3.1 Keyword ExpansionThe query keyword used by users is the most significant but not always sufficient inthe query phase. For example, if a user query with “computer”, he only can get theobject indexed by “computer”. We use WordNet and TSN to expand the query. WithWordNet, we expand the query along three dimensions including hypernym,hyponymy and synonym relation [2]. The original query “computer”, for instance, maybe expanded to include “client, server, website, etc.” In other words, with thoseexpanded words together, the system could raise both the query precision and recall.To extract TSN from the collection, we use a popular association mining algorithm– Apriori [12] — to mine out the association rules between words. Here, we onlyconsider one-to-one term relationship. Two functions—confidence and support— areused in describing word relations. We define confidence (conf) and support (sup) ofterm association ti tj as follows, let( i , j ) ( i ) ( j ) D t t = D t ∩ D t (1)where D(ti) and D(tj) stand for the documents including term ti and ti respectively.Therefore, D(ti)∩ D(tj) is the set of documents that include both ti and tj. We define|| ( ) |||| ( , ) ||ii jt tD tD t tConf i− > j = (2)where|| ( , ) || i j D t t stands for the total number of documents that include both term ti,and tj; and || ( ) || i D t stands for the total number of documents that include ti ,DD t tSup i jtitj|| ( , ) ||− > = (3)where D stands for the number of document in the database.Those relationships are extracted and represented with two matrixes, we could usethem to expand the query keywords. For example, the keyword “computer” has thehighest confidence and support with the words “desktop, series, price, driver…etc”which are not described in WordNet but can be used to expand the original query.
In recent years, huge amount of information is posted on the Web and it continues toincrease with an explosive speed. But we cannot access to the information or use itefficiently and effectively unless it is well organized and indexed. Many searchengines have been created for this need in current years. Web users, however, usuallysubmit only one single word as their queries on the Web [5], especially for a WebImage queries. It is even worse that the users’ query words may be quite different tothe ones used in the documents in describing the same semantics. That means a gapexists between user’s query space and document representation space. This problemresults in lower precisions and recalls of queries. The user may get an overwhelmingbut large percent of irrelevant documents in the result set. In fact, this is a toughproblem in Web information retrieval. An effective method for solving the aboveproblems is query expansion. In this paper, we provide a novel query expansionmethod based on the combination of WordNet [2], an online lexical system, and TSN,a term semantic network extracted from the collection. Our method has beenemployed in our Web image search system [4].3.1 Keyword ExpansionThe query keyword used by users is the most significant but not always sufficient inthe query phase. For example, if a user query with “computer”, he only can get theobject indexed by “computer”. We use WordNet and TSN to expand the query. WithWordNet, we expand the query along three dimensions including hypernym,hyponymy and synonym relation [2]. The original query “computer”, for instance, maybe expanded to include “client, server, website, etc.” In other words, with thoseexpanded words together, the system could raise both the query precision and recall.To extract TSN from the collection, we use a popular association mining algorithm– Apriori [12] — to mine out the association rules between words. Here, we onlyconsider one-to-one term relationship. Two functions—confidence and support— areused in describing word relations. We define confidence (conf) and support (sup) ofterm association ti tj as follows, let( i , j ) ( i ) ( j ) D t t = D t ∩ D t (1)where D(ti) and D(tj) stand for the documents including term ti and ti respectively.Therefore, D(ti)∩ D(tj) is the set of documents that include both ti and tj. We define|| ( ) |||| ( , ) ||ii jt tD tD t tConf i− > j = (2)where|| ( , ) || i j D t t stands for the total number of documents that include both term ti,and tj; and || ( ) || i D t stands for the total number of documents that include ti ,DD t tSup i jtitj|| ( , ) ||− > = (3)where D stands for the number of document in the database.Those relationships are extracted and represented with two matrixes, we could usethem to expand the query keywords. For example, the keyword “computer” has thehighest confidence and support with the words “desktop, series, price, driver…etc”which are not described in WordNet but can be used to expand the original query.