Web query expansion based on association rules mining with e hownet and google chrome extension (release)

Web query expansion based on
association rules mining with
eHownet and Google chrome
extension
楊曜年 Paul Yang

Outline
 Introduction
◦ Background
◦ Purpose
 Related works
◦ eHownet
◦ Aprori algorithm
◦ Relevant research
 Aprori-based query expansion for Chinese IR
◦ System arch
◦ Chinese Word Segmentation / Feature selection
◦ Query Expansion by eHownet
◦ Aprori-based noise word filter
 Experimental results
 Conclusions and future works

Background
According to the marketing survey report (中文搜索引擎
存在问题, 2009, 北京正望咨询; 搜索引擎市場調查報告專題, 新浪網)
 In Google and Baidu, 17% and 24% of user can’t find
the web-pages they want
 58.6% of user just checks the first few pages and skips
the later pages
 50% of user has a little or no knowledge background
for topic they’re going to query

Background - cont
Why Search Engines Fail to Search ?
Relevant Documents?
• Users do not give sufficient number of
keywords
Ex. T1 (query)  T1 + T2 + T3 (expect to see)
• Users do not give good keywords
◦ Vocabulary gaps
◦ Lack of domain knowledge
Ex. T1 + T2 (query)  T3 + T4 (expect to see)

Background - cont
Users, particularly, children, may be suffered with
the problem in Google because of misusing Chinese
synonyms as search query to cause a decrease in
precision
Ex
Wanna find”深夜食堂” but misuse ” 深夜酒家” as query

Background - cont
the search engine like Google that claims it has
many text mining techs !

Background - cont
The actual test on Google (10 results per page, total 4 pages )
Ex 1
• User wants to find “隱形飛機 ”but uses “私密飛行器”as query, the results
shows its precision rate only reaches 5% (2/40)
• If we evaluate based on whether user can get the result in the first page, the
precision rate reaches 0% (2/40)
Ex 2
• find”深夜食堂” but use ” 深夜酒家” as query, the results shows its precision
rate only reaches 4%
• Get NONE of the related results in first page 0% ( 0/10)

Background - cont
Solutions to improve this problem:
 Global methods
◦ Query expansion/reformulation
 Thesauri (ex. WordNet)
 Automatic thesaurus generation
 Local methods
◦ Relevance feedback
◦ Pseudo relevance feedback

Background - cont
Based on the our observation, compared with English
WorldNet (109,000 synonym sets), Chinese
WorldNet provides insufficient info for query
expansion.

Background - cont
Another major problem if we use the Thesauri like
WordNet for query expansion
“too many noise words which cannot be found in search engine”
Ex. “私密飛行器” as an input for query expansion
After expansion :
(a1 a2 a3, + b1, b2,……,b5)
秘密鬼祟暗地隱蔽隱形私密
噴氣機座機飛行器飛機
隱秘

Background - cont
Google Chrome Extension

Purpose
The idea of this paper is to:
Improve misuse of Chinese synonym by giving user our
suggested keywords based on Google
browser extension using CKIP’s eHownet for query
expansion and data mining algorithm “Aprori” to analyze
the retrieved web-pages to get the association rules for
filtering the noise word to improve the overall precision.

廣義知網知識本體（Extended-
HowNet Ontology）
 E-HowNet is an entity-relation model for
lexical semantic representation extended
from HowNet.
 {clothing|衣物} [衣衫]
 – 鞋子|shoes [木屐, 木鞋,球鞋, 溜冰鞋, 靴子]
 – 褲子|trousers [褲子, 運動褲]
 – 內衣|underwear [內衣]
 – 禮服|ceremonial robe/dress [禮服, 白紗,婚
 紗]

The Apriori Algorithm
An influential algorithm for mining frequent itemsets for boolean
association rules.
Key Concepts :
• Frequent Itemsets: The sets of item which has minimum support
(denoted by Li for ith-Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is
generated by joining Lk-1 with itself.
State University of New York, Stony
Brook 16

Web Query Expansion by WordNet
Zhiguo Gong, Chan Wa Cheang, and Leong Hou U 2005
 Aim for a Web Image queries
 Web query expansion by using WordNet and
TSN to extend the scope of the original query
 TSN works as Keyword Filtering based on
Aprori
Relevant researches

Aprori-based query expansion for
Chinese IR

Query expansion through ehownet
May use query of “秘密噴氣機”for “隱形飛機”
• After Chinese word segment: 秘密(VH) 噴氣機(Na )
• Query eHowNet by “秘密” and “噴氣機”

Feature selection
• Use Jieba (结巴中文分词) with 繁體 dictionary rather
than CKIP due to performance and prototype
integration concern
• Extract only Verb, noun and ADJ of POS tag from
every sentence
深夜/a 食堂/n 维基百科/nz 自由/a 的/uj 百科全书/nz
深夜/a 食堂/n 深夜/a 食堂/n 漫画/n 深夜/t 食堂/n 在线漫画/n 动漫/n 之家/r 漫画
网/n
EX.

Feature selection - cont
• Use TF/IDF to pick the words with high
values, which is able to filter some “綴詞”
ex. 年、月、日、你、我、他、段

Association Rules from webpage using
Aprori
隱形戰機维基百科
自由百科全书维基
百科飛機
Webpage A
美部署隱形飛機
對抗中國
Webpage B
隱形飛機是降低
飛機電光聲可探
測特徵使雷達探
測器
Webpage C
隱形飛機
Curial info (Rules)

Noise word filter
Before the filtering process :
After the filtering process (min_support: 0.1 & mid_confidence:
0.75) :
秘密鬼祟暗地隱蔽隱形私密
噴氣機座機飛行器飛機
隱秘
隱形飛機秘密飛機

Experiment Setup
 Simulate that user may use imprecise query
 Base on 4 topics (隱形飛機, 飢餓遊戲, 威力
彩, 深夜食堂) to download the webpage from
Google (24 pages per possible query, totally 1320
pages) and manually review all pages to count the
precision against 4 concepts
 Use solely precision measure, no recall and F-
measure to estimate the performance.

Experiment Setup – cont
使用者預查詢的主題深夜食堂飢餓遊戲威力彩隱形飛機
使用者可能誤用的查詢字
深夜酒家飢餓玩樂神力色彩隱形飛行器
深夜飯館飢餓遊憩神力顏色隱形座機
深夜飲食店飢餓嬉遊神力彩隱秘飛機
夜深食堂飢餓嬉鬧威力彩色隱秘飛行器
夜深酒家挨餓遊戲威力色彩隱秘座機
夜深飯館挨餓玩樂威力顏色暗地飛機
夜深飲食店挨餓遊憩神力彩色暗地飛行器
半夜食堂挨餓嬉遊暗地座機
半夜酒家挨餓嬉鬧私密飛機
半夜飯館口燥遊戲私密飛行器
半夜飲食店口燥玩樂私密座機
深更食堂口燥遊憩秘密飛機
深更酒家口燥嬉遊秘密飛行器
深更飯館口燥嬉鬧秘密座機
深更飲食店焦渴遊戲鬼祟飛機
焦渴玩樂鬼祟飛行器
焦渴遊憩鬼祟座機
焦渴嬉遊
焦渴嬉鬧

Experiment Setup - cont
The scenarios to validate:
 Users have pre-knowledge to concept and
just uses imprecise word as query.
 Users have a few pre-knowledge but can’t
determine query by our suggesting terms

Experiment result – First
Hit rate: 100% (隱形飛機 , 飢餓遊戲威力彩
深夜食堂) , the correct terms are included
after query expand .
可能誤用的查詢字神力彩挨餓遊憩暗地座機深更酒家
Aprori 過濾後的建議結
果
威力彩, 威
力顏色, 神
力色彩,
神力彩色,
威力彩色
飢餓玩樂,
挨餓遊
戲,挨餓嬉
鬧,
挨餓遊憩
飢餓遊
戲,飢餓嬉
遊
隱形飛機,
秘密飛機
夜深酒家,
深夜食堂,
深夜飲食店,
夜深食堂

Experiment result – Second
0%
10%
20%
30%
40%
50%
60%
隱形飛機深夜食堂飢餓遊戲威力彩
Before
After
Average precision

Summary
 For user having a few knowledge, the query expansion can let user have
more option to choose and modify its imprecise query.
 Query expansion with online dictionary ehownet + noise filter improves
the average precision around 20%
 Improve the keyword set containing many topics and concepts
 Use the validated dataset of data-mining with the full search engine
function to validate based on precision/recall measure
 Improve the case that user may misuse spoken word (口語字) by other
dictionary
 Improve the mining performance by other algorithms like FP-growth

Problem observations
深夜食堂:
關聯主題:
戲劇漫畫小說安倍夜郎
FOOD eHownet同義字組合:
深夜深更更深夜深半夜半夜三更三
更半夜
食堂飲食店飯館酒家啤酒屋
8x 5 = 40
The possible query Google Search result
occurs in first page
(within 10 links) hit rate % 中文斷詞
深夜食堂 40/40 1 100%
深夜(Nd) 食堂
(Nc)
夜深食堂 40/40 1 100%
半夜食堂 34/40 1 85%
三更半夜食堂 28/40 1 70%
深夜飯館 26/40 1 65%
深更食堂 18/40 found in 10 1 45%
深夜啤酒屋 10/40 1 25%
更深食堂 10/40 1 25%
深夜酒家 2/40 0 5%
深夜飲食店 1/40 0 2%
半夜飯館 1/40 0 2.00%
深更飲食店 0/40 0 0.00%
深更啤酒屋 0/40 0 0.00%

更深酒家 0/40 0 0.00%
夜深酒家 0/40 0 0.00%
夜深啤酒屋 0/40 0 0.00%
三更半夜飲食店 0/40 0 0%
半夜三更酒家 0/40 0 0%
半夜飲食店 0/40 0 0%
AVG 42% 28%
The query over 60% hit (high
covergae) 05:19 26.30%
EXCEPTION
半夜居酒屋漫畫 6/40 15.00%
Key index in the highest weight
食堂 5/40 Y 13%
深夜 16/40 Y 40%
Problem observations - cont

隱形飛機
關聯主題: 隐形軍事科技, 隱形UFO, 无人驾驶,
ehownet 同義字
組合:
隱秘隱蔽秘密鬼祟暗地私密鬼鬼祟祟
航空器座機噴氣機航空器
飛行器
7 X 5 = 35
可能query word
Google Search
result (All)
Google Search
result 軍事
occurs in first
page (within 10
links) hit rate % (All)
hit rate %
(specific) 中文斷詞
隱形飛機 40/40 40/40 1 100% 100%隱形(VH) 飛機(Na)
隱形飛行器 39/40 39/40 1 97.50% 97.50%隱形(VH) 飛行器(Na)
隱形噴氣機 35/40 35/40 1 87.50% 87.50%隱形(VH) 噴氣機(Na)
隱形噴氣機 34/40 34/40 1 85% 85%隱形(VH) 噴氣機(Na)
隱形座機 32/40 32/40 1 80% 80%隱形(VH) 座機(Na)
隱秘飛機
20/40 (重複一樣的
文章:17/20)
文章:17/20) 1 50.00% 50.00%隱秘(VH) 飛機(Na)
隱秘噴氣機
文章: 16/19)
文章: 16/19) 0 48% 48%秘密(VH) 噴氣機(Na)
暗地飛行器 7/40 1/40 0 17.50% 2.00%暗地(D) 飛行器(Na)
隱蔽飛機 6/40 4/40 0 15.00% 10.00%隱蔽(VH) 飛機(Na)
私密飛行器 5/40 2/40 0 12.50% 5.00%私密(VH) 飛行器(Na)
秘密飛機 6/40, 无人驾驶, 9 2/40 0 7.50% 5.00%秘密(VH) 飛機(Na)
暗地噴氣機 3/40 2/40 1 7.50% 5.00%暗地(D) 噴氣機(Na)
鬼祟飛機 0/40 0/40 0 0.00% 0.00%鬼祟(VH) 飛機(Na)

私密飛機 0/40 0/40 0 0.00% 0.00%私密(VH) 飛機(Na)
鬼鬼祟祟飛機 0/40 0/40 0 0.00% 0.00%
鬼鬼祟祟(VH) 飛機
(Na)
隱蔽座機 0/40 0/40 0 0.00% 0.00%隱蔽(VH) 座機(Na)
鬼祟噴氣機 0/40 0/40 0 0.00% 0.00%鬼祟(VH) 噴氣機(Na)
Avg 41% 36% 34%
The query over 60% hit
(high covergae) 05:17 29.00% 29.00%
EXCEPTION
看不見的飛機 3/40 7.50%
看(VC) 不見(VH) 的
(DE) 飛機(Na)
Key index in the highest
weight
隱形 1/20 Y 5%
飛機 0/40 0%

0 0.2 0.4 0.6 0.8 1 1.2
深夜食堂
夜深食堂
半夜食堂
三更半夜食堂
深夜飯館
深更食堂
深夜啤酒屋
更深食堂
深夜酒家
深夜飲食店
半夜飯館
深更飲食店
深更啤酒屋
更深酒家
夜深酒家
夜深啤酒屋
三更半夜飲食店
半夜三更酒家
半夜飲食店
topic coverage %
Occurs in first page Yes/No

Web query expansion based on association rules mining with e hownet and google chrome extension (release)

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Web query expansion based on association rules mining with e hownet and google chrome extension (release)

Semelhante a Web query expansion based on association rules mining with e hownet and google chrome extension (release) (20)

Mais de Paul Yang

Mais de Paul Yang (19)

Último

Último (20)

Web query expansion based on association rules mining with e hownet and google chrome extension (release)

Notas do Editor