SlideShare uma empresa Scribd logo
1 de 44
Web query expansion based on
association rules mining with
eHownet and Google chrome
extension
楊曜年 Paul Yang
Outline
 Introduction
◦ Background
◦ Purpose
 Related works
◦ eHownet
◦ Aprori algorithm
◦ Relevant research
 Aprori-based query expansion for Chinese IR
◦ System arch
◦ Chinese Word Segmentation / Feature selection
◦ Query Expansion by eHownet
◦ Aprori-based noise word filter
 Experimental results
 Conclusions and future works
Introduction
Background
According to the marketing survey report (中文搜索引擎
存在问题, 2009, 北京正望咨询; 搜索引擎市場調查報告專題, 新浪網)
 In Google and Baidu, 17% and 24% of user can’t find
the web-pages they want
 58.6% of user just checks the first few pages and skips
the later pages
 50% of user has a little or no knowledge background
for topic they’re going to query
Background - cont
Why Search Engines Fail to Search ?
Relevant Documents?
• Users do not give sufficient number of
keywords
Ex. T1 (query)  T1 + T2 + T3 (expect to see)
• Users do not give good keywords
◦ Vocabulary gaps
◦ Lack of domain knowledge
Ex. T1 + T2 (query)  T3 + T4 (expect to see)
Background - cont
Users, particularly, children, may be suffered with
the problem in Google because of misusing Chinese
synonyms as search query to cause a decrease in
precision
Ex
Wanna find”深夜食堂” but misuse ” 深夜酒家” as query
Background - cont
the search engine like Google that claims it has
many text mining techs !
Background - cont
The actual test on Google (10 results per page, total 4 pages )
Ex 1
• User wants to find “隱形飛機 ”but uses “私密飛行器”as query, the results
shows its precision rate only reaches 5% (2/40)
• If we evaluate based on whether user can get the result in the first page, the
precision rate reaches 0% (2/40)
Ex 2
• find”深夜食堂” but use ” 深夜酒家” as query, the results shows its precision
rate only reaches 4%
• Get NONE of the related results in first page 0% ( 0/10)
Background - cont
Solutions to improve this problem:
 Global methods
◦ Query expansion/reformulation
 Thesauri (ex. WordNet)
 Automatic thesaurus generation
 Local methods
◦ Relevance feedback
◦ Pseudo relevance feedback
Background - cont
Based on the our observation, compared with English
WorldNet (109,000 synonym sets), Chinese
WorldNet provides insufficient info for query
expansion.
Background - cont
Another major problem if we use the Thesauri like
WordNet for query expansion
“too many noise words which cannot be found in search engine”
Ex. “私密飛行器” as an input for query expansion
After expansion :
(a1 a2 a3, + b1, b2,……,b5)
秘密 鬼祟 暗地 隱蔽 隱形 私密
噴氣機 座機 飛行器 飛機
隱秘
Background - cont
Google Chrome Extension
Purpose
The idea of this paper is to:
Improve misuse of Chinese synonym by giving user our
suggested keywords based on Google
browser extension using CKIP’s eHownet for query
expansion and data mining algorithm “Aprori” to analyze
the retrieved web-pages to get the association rules for
filtering the noise word to improve the overall precision.
Related works
廣義知網知識本體(Extended-
HowNet Ontology)
 E-HowNet is an entity-relation model for
lexical semantic representation extended
from HowNet.
 {clothing|衣物} [衣衫]
 – 鞋子|shoes [木屐, 木鞋,球鞋, 溜冰鞋, 靴子]
 – 褲子|trousers [褲子, 運動褲]
 – 內衣|underwear [內衣]
 – 禮服|ceremonial robe/dress [禮服, 白紗,婚
 紗]
The Apriori Algorithm
An influential algorithm for mining frequent itemsets for boolean
association rules.
Key Concepts :
• Frequent Itemsets: The sets of item which has minimum support
(denoted by Li for ith-Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is
generated by joining Lk-1 with itself.
State University of New York, Stony
Brook 16
Web Query Expansion by WordNet
Zhiguo Gong, Chan Wa Cheang, and Leong Hou U 2005
 Aim for a Web Image queries
 Web query expansion by using WordNet and
TSN to extend the scope of the original query
 TSN works as Keyword Filtering based on
Aprori
Relevant researches
Relevant researches – cont
Aprori-based query expansion for
Chinese IR
Architecture
Interface
Interface - cont
Interface – cont
Interface – cont
Interface – cont
Query expansion through ehownet
May use query of “秘密噴氣機”for “隱形飛機”
• After Chinese word segment: 秘密(VH) 噴氣機(Na )
• Query eHowNet by “秘密” and “噴氣機”
Feature selection
• Use Jieba (结巴中文分词) with 繁體 dictionary rather
than CKIP due to performance and prototype
integration concern
• Extract only Verb, noun and ADJ of POS tag from
every sentence
深夜/a 食堂/n 维基百科/nz 自由/a 的/uj 百科全书/nz
深夜/a 食堂/n 深夜/a 食堂/n 漫画/n 深夜/t 食堂/n 在线漫画/n 动漫/n 之家/r 漫画
网/n
EX.
Feature selection - cont
• Use TF/IDF to pick the words with high
values, which is able to filter some “綴詞”
ex. 年、月、日、你、我、他、段
Association Rules from webpage using
Aprori
隱形 戰機 维基百科
自由 百科全书 维基
百科 飛機
Webpage A
美 部署 隱形 飛機
對抗 中國
Webpage B
隱形 飛機 是 降低
飛機 電 光 聲 可 探
測 特徵 使 雷達 探
測器
Webpage C
隱形 飛機
Curial info (Rules)
Noise word filter
Before the filtering process :
After the filtering process (min_support: 0.1 & mid_confidence:
0.75) :
秘密 鬼祟 暗地 隱蔽 隱形 私密
噴氣機 座機 飛行器 飛機
隱秘
隱形飛機 秘密飛機
Experiment Result
Experiment Setup
 Simulate that user may use imprecise query
 Base on 4 topics (隱形飛機, 飢餓遊戲, 威力
彩, 深夜食堂) to download the webpage from
Google (24 pages per possible query, totally 1320
pages) and manually review all pages to count the
precision against 4 concepts
 Use solely precision measure, no recall and F-
measure to estimate the performance.
Experiment Setup – cont
使用者預查詢的主題 深夜食堂 飢餓遊戲 威力彩 隱形飛機
使用者可能誤用的查詢字
深夜酒家 飢餓玩樂 神力色彩 隱形飛行器
深夜飯館 飢餓遊憩 神力顏色 隱形座機
深夜飲食店 飢餓嬉遊 神力彩 隱秘飛機
夜深食堂 飢餓嬉鬧 威力彩色 隱秘飛行器
夜深酒家 挨餓遊戲 威力色彩 隱秘座機
夜深飯館 挨餓玩樂 威力顏色 暗地飛機
夜深飲食店 挨餓遊憩 神力彩色 暗地飛行器
半夜食堂 挨餓嬉遊 暗地座機
半夜酒家 挨餓嬉鬧 私密飛機
半夜飯館 口燥遊戲 私密飛行器
半夜飲食店 口燥玩樂 私密座機
深更食堂 口燥遊憩 秘密飛機
深更酒家 口燥嬉遊 秘密飛行器
深更飯館 口燥嬉鬧 秘密座機
深更飲食店 焦渴遊戲 鬼祟飛機
焦渴玩樂 鬼祟飛行器
焦渴遊憩 鬼祟座機
焦渴嬉遊
焦渴嬉鬧
Experiment Setup - cont
The scenarios to validate:
 Users have pre-knowledge to concept and
just uses imprecise word as query.
 Users have a few pre-knowledge but can’t
determine query by our suggesting terms
Experiment result – First
Hit rate: 100% (隱形飛機 , 飢餓遊戲 威力彩
深夜食堂) , the correct terms are included
after query expand .
可能誤用的查詢字 神力彩 挨餓遊憩 暗地座機 深更酒家
Aprori 過濾後的建議結
果
威力彩, 威
力顏色, 神
力色彩,
神力彩色,
威力彩色
飢餓玩樂,
挨餓遊
戲,挨餓嬉
鬧,
挨餓遊憩
飢餓遊
戲,飢餓嬉
遊
隱形飛機,
秘密飛機
夜深酒家,
深夜食堂,
深夜飲食店,
夜深食堂
Experiment result – Second
0%
10%
20%
30%
40%
50%
60%
隱形飛機 深夜食堂 飢餓遊戲 威力彩
Before
After
Average precision
Conclusion and future work
Summary
 For user having a few knowledge, the query expansion can let user have
more option to choose and modify its imprecise query.
 Query expansion with online dictionary ehownet + noise filter improves
the average precision around 20%
 Improve the keyword set containing many topics and concepts
 Use the validated dataset of data-mining with the full search engine
function to validate based on precision/recall measure
 Improve the case that user may misuse spoken word (口語字) by other
dictionary
 Improve the mining performance by other algorithms like FP-growth
Backup
Problem observations
深夜食堂:
關聯主題:
戲劇 漫畫 小說 安倍夜郎
FOOD eHownet同義字組合:
深夜 深更 更深 夜深 半夜 半夜三更 三
更半夜
食堂 飲食店 飯館 酒家 啤酒屋
8x 5 = 40
The possible query Google Search result
occurs in first page
(within 10 links) hit rate % 中文斷詞
深夜食堂 40/40 1 100%
深夜(Nd) 食堂
(Nc)
夜深食堂 40/40 1 100%
半夜食堂 34/40 1 85%
三更半夜食堂 28/40 1 70%
深夜飯館 26/40 1 65%
深更食堂 18/40 found in 10 1 45%
深夜啤酒屋 10/40 1 25%
更深食堂 10/40 1 25%
深夜酒家 2/40 0 5%
深夜飲食店 1/40 0 2%
半夜飯館 1/40 0 2.00%
深更飲食店 0/40 0 0.00%
深更啤酒屋 0/40 0 0.00%
更深酒家 0/40 0 0.00%
夜深酒家 0/40 0 0.00%
夜深啤酒屋 0/40 0 0.00%
三更半夜飲食店 0/40 0 0%
半夜三更酒家 0/40 0 0%
半夜飲食店 0/40 0 0%
AVG 42% 28%
The query over 60% hit (high
covergae) 05:19 26.30%
EXCEPTION
半夜居酒屋漫畫 6/40 15.00%
Key index in the highest weight
食堂 5/40 Y 13%
深夜 16/40 Y 40%
Problem observations - cont
隱形飛機
關聯主題: 隐形軍事科技, 隱形UFO, 无人驾驶,
ehownet 同義字
組合:
隱秘 隱蔽 秘密 鬼祟 暗地 私密 鬼鬼祟祟
航空器 座機 噴氣機 航空器
飛行器
7 X 5 = 35
可能query word
Google Search
result (All)
Google Search
result 軍事
occurs in first
page (within 10
links) hit rate % (All)
hit rate %
(specific) 中文斷詞
隱形飛機 40/40 40/40 1 100% 100%隱形(VH) 飛機(Na)
隱形飛行器 39/40 39/40 1 97.50% 97.50%隱形(VH) 飛行器(Na)
隱形噴氣機 35/40 35/40 1 87.50% 87.50%隱形(VH) 噴氣機(Na)
隱形噴氣機 34/40 34/40 1 85% 85%隱形(VH) 噴氣機(Na)
隱形座機 32/40 32/40 1 80% 80%隱形(VH) 座機(Na)
隱秘飛機
20/40 (重複一樣的
文章:17/20)
20/40 (重複一樣的
文章:17/20) 1 50.00% 50.00%隱秘(VH) 飛機(Na)
隱秘噴氣機
19/40 (重複一樣的
文章: 16/19)
19/40 (重複一樣的
文章: 16/19) 0 48% 48%秘密(VH) 噴氣機(Na)
暗地飛行器 7/40 1/40 0 17.50% 2.00%暗地(D) 飛行器(Na)
隱蔽飛機 6/40 4/40 0 15.00% 10.00%隱蔽(VH) 飛機(Na)
私密飛行器 5/40 2/40 0 12.50% 5.00%私密(VH) 飛行器(Na)
秘密飛機 6/40, 无人驾驶, 9 2/40 0 7.50% 5.00%秘密(VH) 飛機(Na)
暗地噴氣機 3/40 2/40 1 7.50% 5.00%暗地(D) 噴氣機(Na)
鬼祟飛機 0/40 0/40 0 0.00% 0.00%鬼祟(VH) 飛機(Na)
Problem observations - cont
Problem observations - cont
私密飛機 0/40 0/40 0 0.00% 0.00%私密(VH) 飛機(Na)
鬼鬼祟祟飛機 0/40 0/40 0 0.00% 0.00%
鬼鬼祟祟(VH) 飛機
(Na)
隱蔽座機 0/40 0/40 0 0.00% 0.00%隱蔽(VH) 座機(Na)
鬼祟噴氣機 0/40 0/40 0 0.00% 0.00%鬼祟(VH) 噴氣機(Na)
Avg 41% 36% 34%
The query over 60% hit
(high covergae) 05:17 29.00% 29.00%
EXCEPTION
看不見的飛機 3/40 7.50%
看(VC) 不見(VH) 的
(DE) 飛機(Na)
Key index in the highest
weight
隱形 1/20 Y 5%
飛機 0/40 0%
0 0.2 0.4 0.6 0.8 1 1.2
深夜食堂
夜深食堂
半夜食堂
三更半夜食堂
深夜飯館
深更食堂
深夜啤酒屋
更深食堂
深夜酒家
深夜飲食店
半夜飯館
深更飲食店
深更啤酒屋
更深酒家
夜深酒家
夜深啤酒屋
三更半夜飲食店
半夜三更酒家
半夜飲食店
topic coverage %
Occurs in first page Yes/No
Problem observations - cont

Mais conteúdo relacionado

Semelhante a Web query expansion based on association rules mining with e hownet and google chrome extension (release)

Andrew Polaszek - ZooBank: ICZN’s open-access web-based register of all new a...
Andrew Polaszek - ZooBank: ICZN’s open-access web-based register of all new a...Andrew Polaszek - ZooBank: ICZN’s open-access web-based register of all new a...
Andrew Polaszek - ZooBank: ICZN’s open-access web-based register of all new a...
ICZN
 

Semelhante a Web query expansion based on association rules mining with e hownet and google chrome extension (release) (20)

WWW2013: Web Usage Mining with Semantic Analysis
WWW2013: Web Usage Mining with Semantic AnalysisWWW2013: Web Usage Mining with Semantic Analysis
WWW2013: Web Usage Mining with Semantic Analysis
 
BSides LA/PDX
BSides LA/PDXBSides LA/PDX
BSides LA/PDX
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistants
 
API Best Practices
API Best PracticesAPI Best Practices
API Best Practices
 
Context-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresContext-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph Stores
 
Lessons From Spider Support
Lessons From Spider SupportLessons From Spider Support
Lessons From Spider Support
 
⛳️ Votre API passe-t-elle le contrôle technique ?
⛳️ Votre API passe-t-elle le contrôle technique ?⛳️ Votre API passe-t-elle le contrôle technique ?
⛳️ Votre API passe-t-elle le contrôle technique ?
 
Speech-Enabling Web Apps
Speech-Enabling Web AppsSpeech-Enabling Web Apps
Speech-Enabling Web Apps
 
Managing a R&D Lab with Foreman
Managing a R&D Lab with ForemanManaging a R&D Lab with Foreman
Managing a R&D Lab with Foreman
 
如何建置關鍵字精靈 How to Build an Keyword Wizard
如何建置關鍵字精靈 How to Build an Keyword Wizard如何建置關鍵字精靈 How to Build an Keyword Wizard
如何建置關鍵字精靈 How to Build an Keyword Wizard
 
Andrew Polaszek - ZooBank: ICZN’s open-access web-based register of all new a...
Andrew Polaszek - ZooBank: ICZN’s open-access web-based register of all new a...Andrew Polaszek - ZooBank: ICZN’s open-access web-based register of all new a...
Andrew Polaszek - ZooBank: ICZN’s open-access web-based register of all new a...
 
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
 
Call Control Power Tools with Adhearsion
Call Control Power Tools with AdhearsionCall Control Power Tools with Adhearsion
Call Control Power Tools with Adhearsion
 
AdhearsionConf 2013 Keynote
AdhearsionConf 2013 KeynoteAdhearsionConf 2013 Keynote
AdhearsionConf 2013 Keynote
 
Advanced online search through the web
Advanced online search through the webAdvanced online search through the web
Advanced online search through the web
 
Call Control Power Tools with Adhearsion
Call Control Power Tools with Adhearsion Call Control Power Tools with Adhearsion
Call Control Power Tools with Adhearsion
 
SMX Munich 2018 - Current State of JavaScript SEO
SMX Munich 2018 - Current State of JavaScript SEOSMX Munich 2018 - Current State of JavaScript SEO
SMX Munich 2018 - Current State of JavaScript SEO
 
HPC For Bioinformatics
HPC For BioinformaticsHPC For Bioinformatics
HPC For Bioinformatics
 
Web Performance 101
Web Performance 101Web Performance 101
Web Performance 101
 
Voice Applications with Adhearsion @ ATLAUG 2012
Voice Applications with Adhearsion @ ATLAUG 2012Voice Applications with Adhearsion @ ATLAUG 2012
Voice Applications with Adhearsion @ ATLAUG 2012
 

Mais de Paul Yang

A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...
A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...
A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...
Paul Yang
 

Mais de Paul Yang (19)

release_python_day4_slides_201606_1.pdf
release_python_day4_slides_201606_1.pdfrelease_python_day4_slides_201606_1.pdf
release_python_day4_slides_201606_1.pdf
 
release_python_day3_slides_201606.pdf
release_python_day3_slides_201606.pdfrelease_python_day3_slides_201606.pdf
release_python_day3_slides_201606.pdf
 
release_python_day1_slides_201606.pdf
release_python_day1_slides_201606.pdfrelease_python_day1_slides_201606.pdf
release_python_day1_slides_201606.pdf
 
release_python_day2_slides_201606.pdf
release_python_day2_slides_201606.pdfrelease_python_day2_slides_201606.pdf
release_python_day2_slides_201606.pdf
 
RHEL5 XEN HandOnTraining_v0.4.pdf
RHEL5 XEN HandOnTraining_v0.4.pdfRHEL5 XEN HandOnTraining_v0.4.pdf
RHEL5 XEN HandOnTraining_v0.4.pdf
 
Intel® AT-d Validation Overview v0_3.pdf
Intel® AT-d Validation Overview v0_3.pdfIntel® AT-d Validation Overview v0_3.pdf
Intel® AT-d Validation Overview v0_3.pdf
 
HP Performance Tracking ADK_part1.pdf
HP Performance Tracking ADK_part1.pdfHP Performance Tracking ADK_part1.pdf
HP Performance Tracking ADK_part1.pdf
 
HP Performance Tracking ADK part2.pdf
HP Performance Tracking ADK part2.pdfHP Performance Tracking ADK part2.pdf
HP Performance Tracking ADK part2.pdf
 
Determination of Repro Rates 20140724.pdf
Determination of Repro Rates 20140724.pdfDetermination of Repro Rates 20140724.pdf
Determination of Repro Rates 20140724.pdf
 
Debug ADK performance issue 20140729.pdf
Debug ADK performance issue 20140729.pdfDebug ADK performance issue 20140729.pdf
Debug ADK performance issue 20140729.pdf
 
A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...
A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...
A Special-Purpose Peer-to-Peer File Sharing System for Mobile ad Hoc Networks...
 
A brief study on bottlenecks to Intel vs. Acer v0.1.pdf
A brief study on bottlenecks to Intel vs. Acer v0.1.pdfA brief study on bottlenecks to Intel vs. Acer v0.1.pdf
A brief study on bottlenecks to Intel vs. Acer v0.1.pdf
 
出租店系統_楊曜年_林宏庭_OOD.pdf
出租店系統_楊曜年_林宏庭_OOD.pdf出租店系統_楊曜年_林宏庭_OOD.pdf
出租店系統_楊曜年_林宏庭_OOD.pdf
 
Arm Neoverse market update_05122020.pdf
Arm Neoverse market update_05122020.pdfArm Neoverse market update_05122020.pdf
Arm Neoverse market update_05122020.pdf
 
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdfBuilding PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
Building PoC ready ODM Platforms with Arm SystemReady v5.2.pdf
 
Routing Security and Authentication Mechanism for Mobile Ad Hoc Networks
Routing Security and Authentication Mechanism for Mobile Ad Hoc NetworksRouting Security and Authentication Mechanism for Mobile Ad Hoc Networks
Routing Security and Authentication Mechanism for Mobile Ad Hoc Networks
 
Clients developing_chunghwa telecom
Clients developing_chunghwa telecomClients developing_chunghwa telecom
Clients developing_chunghwa telecom
 
English teaching in icebreaker and grammar analysis
English teaching in icebreaker and grammar analysisEnglish teaching in icebreaker and grammar analysis
English teaching in icebreaker and grammar analysis
 
Study mapapi v0.1
Study mapapi v0.1Study mapapi v0.1
Study mapapi v0.1
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Web query expansion based on association rules mining with e hownet and google chrome extension (release)

  • 1. Web query expansion based on association rules mining with eHownet and Google chrome extension 楊曜年 Paul Yang
  • 2. Outline  Introduction ◦ Background ◦ Purpose  Related works ◦ eHownet ◦ Aprori algorithm ◦ Relevant research  Aprori-based query expansion for Chinese IR ◦ System arch ◦ Chinese Word Segmentation / Feature selection ◦ Query Expansion by eHownet ◦ Aprori-based noise word filter  Experimental results  Conclusions and future works
  • 4. Background According to the marketing survey report (中文搜索引擎 存在问题, 2009, 北京正望咨询; 搜索引擎市場調查報告專題, 新浪網)  In Google and Baidu, 17% and 24% of user can’t find the web-pages they want  58.6% of user just checks the first few pages and skips the later pages  50% of user has a little or no knowledge background for topic they’re going to query
  • 5. Background - cont Why Search Engines Fail to Search ? Relevant Documents? • Users do not give sufficient number of keywords Ex. T1 (query)  T1 + T2 + T3 (expect to see) • Users do not give good keywords ◦ Vocabulary gaps ◦ Lack of domain knowledge Ex. T1 + T2 (query)  T3 + T4 (expect to see)
  • 6. Background - cont Users, particularly, children, may be suffered with the problem in Google because of misusing Chinese synonyms as search query to cause a decrease in precision Ex Wanna find”深夜食堂” but misuse ” 深夜酒家” as query
  • 7. Background - cont the search engine like Google that claims it has many text mining techs !
  • 8. Background - cont The actual test on Google (10 results per page, total 4 pages ) Ex 1 • User wants to find “隱形飛機 ”but uses “私密飛行器”as query, the results shows its precision rate only reaches 5% (2/40) • If we evaluate based on whether user can get the result in the first page, the precision rate reaches 0% (2/40) Ex 2 • find”深夜食堂” but use ” 深夜酒家” as query, the results shows its precision rate only reaches 4% • Get NONE of the related results in first page 0% ( 0/10)
  • 9. Background - cont Solutions to improve this problem:  Global methods ◦ Query expansion/reformulation  Thesauri (ex. WordNet)  Automatic thesaurus generation  Local methods ◦ Relevance feedback ◦ Pseudo relevance feedback
  • 10. Background - cont Based on the our observation, compared with English WorldNet (109,000 synonym sets), Chinese WorldNet provides insufficient info for query expansion.
  • 11. Background - cont Another major problem if we use the Thesauri like WordNet for query expansion “too many noise words which cannot be found in search engine” Ex. “私密飛行器” as an input for query expansion After expansion : (a1 a2 a3, + b1, b2,……,b5) 秘密 鬼祟 暗地 隱蔽 隱形 私密 噴氣機 座機 飛行器 飛機 隱秘
  • 12. Background - cont Google Chrome Extension
  • 13. Purpose The idea of this paper is to: Improve misuse of Chinese synonym by giving user our suggested keywords based on Google browser extension using CKIP’s eHownet for query expansion and data mining algorithm “Aprori” to analyze the retrieved web-pages to get the association rules for filtering the noise word to improve the overall precision.
  • 15. 廣義知網知識本體(Extended- HowNet Ontology)  E-HowNet is an entity-relation model for lexical semantic representation extended from HowNet.  {clothing|衣物} [衣衫]  – 鞋子|shoes [木屐, 木鞋,球鞋, 溜冰鞋, 靴子]  – 褲子|trousers [褲子, 運動褲]  – 內衣|underwear [內衣]  – 禮服|ceremonial robe/dress [禮服, 白紗,婚  紗]
  • 16. The Apriori Algorithm An influential algorithm for mining frequent itemsets for boolean association rules. Key Concepts : • Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith-Itemset). • Apriori Property: Any subset of frequent itemset must be frequent. • Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with itself. State University of New York, Stony Brook 16
  • 17. Web Query Expansion by WordNet Zhiguo Gong, Chan Wa Cheang, and Leong Hou U 2005  Aim for a Web Image queries  Web query expansion by using WordNet and TSN to extend the scope of the original query  TSN works as Keyword Filtering based on Aprori Relevant researches
  • 26. Query expansion through ehownet May use query of “秘密噴氣機”for “隱形飛機” • After Chinese word segment: 秘密(VH) 噴氣機(Na ) • Query eHowNet by “秘密” and “噴氣機”
  • 27. Feature selection • Use Jieba (结巴中文分词) with 繁體 dictionary rather than CKIP due to performance and prototype integration concern • Extract only Verb, noun and ADJ of POS tag from every sentence 深夜/a 食堂/n 维基百科/nz 自由/a 的/uj 百科全书/nz 深夜/a 食堂/n 深夜/a 食堂/n 漫画/n 深夜/t 食堂/n 在线漫画/n 动漫/n 之家/r 漫画 网/n EX.
  • 28. Feature selection - cont • Use TF/IDF to pick the words with high values, which is able to filter some “綴詞” ex. 年、月、日、你、我、他、段
  • 29. Association Rules from webpage using Aprori 隱形 戰機 维基百科 自由 百科全书 维基 百科 飛機 Webpage A 美 部署 隱形 飛機 對抗 中國 Webpage B 隱形 飛機 是 降低 飛機 電 光 聲 可 探 測 特徵 使 雷達 探 測器 Webpage C 隱形 飛機 Curial info (Rules)
  • 30. Noise word filter Before the filtering process : After the filtering process (min_support: 0.1 & mid_confidence: 0.75) : 秘密 鬼祟 暗地 隱蔽 隱形 私密 噴氣機 座機 飛行器 飛機 隱秘 隱形飛機 秘密飛機
  • 32. Experiment Setup  Simulate that user may use imprecise query  Base on 4 topics (隱形飛機, 飢餓遊戲, 威力 彩, 深夜食堂) to download the webpage from Google (24 pages per possible query, totally 1320 pages) and manually review all pages to count the precision against 4 concepts  Use solely precision measure, no recall and F- measure to estimate the performance.
  • 33. Experiment Setup – cont 使用者預查詢的主題 深夜食堂 飢餓遊戲 威力彩 隱形飛機 使用者可能誤用的查詢字 深夜酒家 飢餓玩樂 神力色彩 隱形飛行器 深夜飯館 飢餓遊憩 神力顏色 隱形座機 深夜飲食店 飢餓嬉遊 神力彩 隱秘飛機 夜深食堂 飢餓嬉鬧 威力彩色 隱秘飛行器 夜深酒家 挨餓遊戲 威力色彩 隱秘座機 夜深飯館 挨餓玩樂 威力顏色 暗地飛機 夜深飲食店 挨餓遊憩 神力彩色 暗地飛行器 半夜食堂 挨餓嬉遊 暗地座機 半夜酒家 挨餓嬉鬧 私密飛機 半夜飯館 口燥遊戲 私密飛行器 半夜飲食店 口燥玩樂 私密座機 深更食堂 口燥遊憩 秘密飛機 深更酒家 口燥嬉遊 秘密飛行器 深更飯館 口燥嬉鬧 秘密座機 深更飲食店 焦渴遊戲 鬼祟飛機 焦渴玩樂 鬼祟飛行器 焦渴遊憩 鬼祟座機 焦渴嬉遊 焦渴嬉鬧
  • 34. Experiment Setup - cont The scenarios to validate:  Users have pre-knowledge to concept and just uses imprecise word as query.  Users have a few pre-knowledge but can’t determine query by our suggesting terms
  • 35. Experiment result – First Hit rate: 100% (隱形飛機 , 飢餓遊戲 威力彩 深夜食堂) , the correct terms are included after query expand . 可能誤用的查詢字 神力彩 挨餓遊憩 暗地座機 深更酒家 Aprori 過濾後的建議結 果 威力彩, 威 力顏色, 神 力色彩, 神力彩色, 威力彩色 飢餓玩樂, 挨餓遊 戲,挨餓嬉 鬧, 挨餓遊憩 飢餓遊 戲,飢餓嬉 遊 隱形飛機, 秘密飛機 夜深酒家, 深夜食堂, 深夜飲食店, 夜深食堂
  • 36. Experiment result – Second 0% 10% 20% 30% 40% 50% 60% 隱形飛機 深夜食堂 飢餓遊戲 威力彩 Before After Average precision
  • 38. Summary  For user having a few knowledge, the query expansion can let user have more option to choose and modify its imprecise query.  Query expansion with online dictionary ehownet + noise filter improves the average precision around 20%  Improve the keyword set containing many topics and concepts  Use the validated dataset of data-mining with the full search engine function to validate based on precision/recall measure  Improve the case that user may misuse spoken word (口語字) by other dictionary  Improve the mining performance by other algorithms like FP-growth
  • 40. Problem observations 深夜食堂: 關聯主題: 戲劇 漫畫 小說 安倍夜郎 FOOD eHownet同義字組合: 深夜 深更 更深 夜深 半夜 半夜三更 三 更半夜 食堂 飲食店 飯館 酒家 啤酒屋 8x 5 = 40 The possible query Google Search result occurs in first page (within 10 links) hit rate % 中文斷詞 深夜食堂 40/40 1 100% 深夜(Nd) 食堂 (Nc) 夜深食堂 40/40 1 100% 半夜食堂 34/40 1 85% 三更半夜食堂 28/40 1 70% 深夜飯館 26/40 1 65% 深更食堂 18/40 found in 10 1 45% 深夜啤酒屋 10/40 1 25% 更深食堂 10/40 1 25% 深夜酒家 2/40 0 5% 深夜飲食店 1/40 0 2% 半夜飯館 1/40 0 2.00% 深更飲食店 0/40 0 0.00% 深更啤酒屋 0/40 0 0.00%
  • 41. 更深酒家 0/40 0 0.00% 夜深酒家 0/40 0 0.00% 夜深啤酒屋 0/40 0 0.00% 三更半夜飲食店 0/40 0 0% 半夜三更酒家 0/40 0 0% 半夜飲食店 0/40 0 0% AVG 42% 28% The query over 60% hit (high covergae) 05:19 26.30% EXCEPTION 半夜居酒屋漫畫 6/40 15.00% Key index in the highest weight 食堂 5/40 Y 13% 深夜 16/40 Y 40% Problem observations - cont
  • 42. 隱形飛機 關聯主題: 隐形軍事科技, 隱形UFO, 无人驾驶, ehownet 同義字 組合: 隱秘 隱蔽 秘密 鬼祟 暗地 私密 鬼鬼祟祟 航空器 座機 噴氣機 航空器 飛行器 7 X 5 = 35 可能query word Google Search result (All) Google Search result 軍事 occurs in first page (within 10 links) hit rate % (All) hit rate % (specific) 中文斷詞 隱形飛機 40/40 40/40 1 100% 100%隱形(VH) 飛機(Na) 隱形飛行器 39/40 39/40 1 97.50% 97.50%隱形(VH) 飛行器(Na) 隱形噴氣機 35/40 35/40 1 87.50% 87.50%隱形(VH) 噴氣機(Na) 隱形噴氣機 34/40 34/40 1 85% 85%隱形(VH) 噴氣機(Na) 隱形座機 32/40 32/40 1 80% 80%隱形(VH) 座機(Na) 隱秘飛機 20/40 (重複一樣的 文章:17/20) 20/40 (重複一樣的 文章:17/20) 1 50.00% 50.00%隱秘(VH) 飛機(Na) 隱秘噴氣機 19/40 (重複一樣的 文章: 16/19) 19/40 (重複一樣的 文章: 16/19) 0 48% 48%秘密(VH) 噴氣機(Na) 暗地飛行器 7/40 1/40 0 17.50% 2.00%暗地(D) 飛行器(Na) 隱蔽飛機 6/40 4/40 0 15.00% 10.00%隱蔽(VH) 飛機(Na) 私密飛行器 5/40 2/40 0 12.50% 5.00%私密(VH) 飛行器(Na) 秘密飛機 6/40, 无人驾驶, 9 2/40 0 7.50% 5.00%秘密(VH) 飛機(Na) 暗地噴氣機 3/40 2/40 1 7.50% 5.00%暗地(D) 噴氣機(Na) 鬼祟飛機 0/40 0/40 0 0.00% 0.00%鬼祟(VH) 飛機(Na) Problem observations - cont
  • 43. Problem observations - cont 私密飛機 0/40 0/40 0 0.00% 0.00%私密(VH) 飛機(Na) 鬼鬼祟祟飛機 0/40 0/40 0 0.00% 0.00% 鬼鬼祟祟(VH) 飛機 (Na) 隱蔽座機 0/40 0/40 0 0.00% 0.00%隱蔽(VH) 座機(Na) 鬼祟噴氣機 0/40 0/40 0 0.00% 0.00%鬼祟(VH) 噴氣機(Na) Avg 41% 36% 34% The query over 60% hit (high covergae) 05:17 29.00% 29.00% EXCEPTION 看不見的飛機 3/40 7.50% 看(VC) 不見(VH) 的 (DE) 飛機(Na) Key index in the highest weight 隱形 1/20 Y 5% 飛機 0/40 0%
  • 44. 0 0.2 0.4 0.6 0.8 1 1.2 深夜食堂 夜深食堂 半夜食堂 三更半夜食堂 深夜飯館 深更食堂 深夜啤酒屋 更深食堂 深夜酒家 深夜飲食店 半夜飯館 深更飲食店 深更啤酒屋 更深酒家 夜深酒家 夜深啤酒屋 三更半夜飲食店 半夜三更酒家 半夜飲食店 topic coverage % Occurs in first page Yes/No Problem observations - cont

Notas do Editor

  1. 以上可以反映 中文搜索 仍有改善空間, 使用者依據對主題熟悉程度不同 而使用不同的檢索詞 可能會對查詢結果產生巨大影響 及 大部分使用者都不會去看太 多結果進而來判定查詢的功效提问1为“你经常使用下面的哪些搜索引擎?”, 可以多选。从图1可知,受访用户使用最普遍的是百 度,使用比例达91.7%;Google其次,使用比例为 52.2%;中搜和openfind的使用程度较低,均在2% 以下问3“在利用搜索引擎查找信息之前,通常您会做哪 些准备工作?”。选项“找出与检索课题相关的检索词 (如近义词等)备用”被选率最高,占38.2%;其次 是选项“没有任何准备工作”,占35.1%;最低是 “了解一些背景知识,然后选择检索词检索”,占 26.8%。这说明大多数用户在检索前的有一定的准备 工作,但还有相当一部分人是在毫无准备或粗略准备 的情况下进行检索提问4中我们以“当搜索引擎返回的结果不能 满足您的要求时,通常你会:”为题,调查受访者对 检索的调整行为。选项“浏览部分检索结果并获得相 关信息,然后重新选择检索词检索”的比例最大,占 37.2%;其次是选项“直接更换检索词重新检索”占 27.8%和“利用上一次检索用词的同义词或相关词进 行再次检索”占26.1%;最低的是选项“转到其他的 搜索引擎重新检索”,占8.9%。这说明很大一部分用 户会把检索过程看作一个学习的过程,在检索结果不 能满足要求时将优先选择改变检索用词,不会轻易改 用其他的搜索引擎检索提问5关注受访者对搜索引擎检索结果的浏览行 为,即对检索结果的选择行为。“在查看检索结果的 时候你会?”中,选项“只看前几页”被选率最高, 达58.6%;其次是“随便抽几页看”(23.1%);最 低的是“一页一页看到最后”,占16.5%;第一题中,检索课题为“萤火虫发光的原理”。 如图3所示,用户首选的检索用语是原自然句——— “萤火虫发光的原理”,比例高达44%;其次是经过 概念拆分的关键词组合———“萤火虫发光原理”,比 例为30.2%;选择“萤火虫发光”的受访者为 16.9%;比例最低的为“萤火虫发光为什么” (7%),这说明用户对 近义词等并不了解搜索引擎一般都采用关键词检索方式,但许多情况下,用户很难简单地用关键词或关键词之间的组配来准确地表达真正需要的信息内容,表达困难导致检索困难。用户若想更便捷的获取信息,获取更到质量的信息,就需要掌握一定的检索规则,而不是仅仅通过关键词之间的组配进行检索。用户需要的是“傻瓜化”的检索系统,希望无需掌握纷繁芜杂的检索规则也可以用搜索引擎进行信息检索2009年搜索引擎市场调查报告专题
  2. 根據以上 我們可以將此問題簡化為兩種情況 並以圖四的3D向量空間來表示一, 使用者沒有給予搜索引擎足夠的關鍵字 Query = T1 + T2 ExpectToSee = T1 + T2 + T3  搜索引擎產生滿足T1 + T2的所有的網頁 但使用者預期得到T1 + T2 + T3的結果Query所涵蓋的向量面積太大 只會得到非常低的Precision及Recall的結果 二, 使用者使用不夠精確的中文關鍵字 Query = T1 + T2+ T3 ExpectToSee = T3 + T4 + T5 搜索引擎產生滿足T1 + T2 的網頁結果 但使用者預期得到T3+T4 其中Query與預查詢的相關文件之間的向量距離太大 故只會得到非常低的Precision及Recall的結果Vocabulary gaps / Diversity & Vastness of webLack of domain knowledge
  3. Query Understanding Gets to the deeper meaning of the words you type.like Query Understanding, Spelling correction, Synonyms, 以Google而言(即使搜尋演算法 標榜具同義字修正) 但使用者可能因為背景知識不足或中文詞彙使用不夠精確的情況去誤用中文同義字詞 往往得到不正確或只有少數符合查詢的結果(http://www.google.com/intl/zh-Hant/insidesearch/howsearchworks/algorithms.html)
  4. Query Understanding Gets to the deeper meaning of the words you type.like Query Understanding, Spelling correction, Synonyms, 以Google而言(即使搜尋演算法 標榜具同義字修正) 但使用者可能因為背景知識不足或中文詞彙使用不夠精確的情況去誤用中文同義字詞 往往得到不正確或只有少數符合查詢的結果(http://www.google.com/intl/zh-Hant/insidesearch/howsearchworks/algorithms.html)
  5. Relevance feedback and query expansionIn most collections, the same concept may be referred to using different words. This issue, known as synonymy , has an impact on the recall of most information retrieval systems. For example, you would want a search for aircraft to match plane (but only for references to an airplane, not a woodworking plane), and for a search on thermodynamics to match references to heat in appropriate discussions. Users often attempt to address this problem themselves by manually refining a query, as was discussed in Section 1.4 ; in this chapter we discuss ways in which a system can help with query refinement, either fully automatically or with the user in the loop.The methods for tackling this problem split into two major classes: global methods and local methods. Global methods are techniques for expanding or reformulating query terms independent of the query and results returned from it, so that changes in the query wording will cause the new query to match other semantically similar terms. Global methods include:Query expansion/reformulation with a thesaurus or WordNet (Section 9.2.2 )Query expansion via automatic thesaurus generation (Section 9.2.3 )Techniques like spelling correction (discussed in Chapter 3 )Local methods adjust a query relative to the documents that initially appear to match the query. The basic methods here are:Relevance feedback (Section 9.1 )Pseudo relevance feedback, also known as Blind relevance feedback (Section 9.1.6 )(Global) indirect relevance feedback (Section 9.1.7 )Query refinement techniques such asquery expansion, query suggestion, relevance feedback improve rankingA thesaurus provides information on synonyms and semantically related words and phrases.Example:physician syn: ||croaker, doc, doctor, MD, medical, mediciner, medico, ||sawbonesrel: medic, general practitioner, surgeon,
  6. 另外 由於Google 在搜尋結果排序邏輯,依照Larry Page 所開發的網頁級別概念,並依據導入連結(incoming link)的多寡,下去做排序 也就是越熱門的相關網頁 其排序越前面 依據實際的測試 在使用者使用不夠精確的中文關鍵字的問題上 如果使用者查詢是屬於點選率高 熱門的主題 例如 使用者預查詢 “我們發財了”( 三立電視自製的台灣偶像劇名) 但使用咱們發財了的查詢 實際的測試顯示 相關的網頁還是有機會出現在前幾個搜索頁面中 故本系統是設計成Google 的一個延伸套件(extension) 當使用者由於使用不夠精確關鍵字組 而對Google給予的結果不滿意時 使用者可以使用本系統去做查詢延伸 提升整體效率.
  7. Limitation: 如果使用者所使用的查詢字組 沒有涵蓋在eHowNetor WordNet的資料庫 例如 使用口語詞而不是 書面語 如 “看不見 於 隱形”,“ 工夫 於 時間”,“嚇唬於恐嚇”“or “哪一天 於 甚麼日子”必須使用額外辭典來支援 但本論文將不討論OverallAd-hoc searchFind relevant documents for an arbitrary text queryCoverage and freshnessQuery refinement techniques such as query expansion, query suggestion, relevance feedback improve ranking
  8. 廣義知網知識本體(Extended-HowNet Ontology)是由中央研究院資訊所詞庫小組依據知網(HowNet)語義義原角色知識本體修改建構完成的。目的在建立一表達概念與概念之間,以及概念所具有之屬性間的關係的詞彙知識庫,為廣義知網的基礎資料庫。茲將修訂知網的部份說明如下:將知網知識本體子樹架構修改、細化、重整成一完整單一的知識本體。廣義知網知識本體以{all|全}為根目錄,其下有兩棵子樹{entity|事物}及{relation|關係}。我們將原知網的event|事件}、{entity|實體}、{Attribute Value|屬性值}等三棵子樹對應到詞庫小組知識本體{entity|事物}下的{event|事件}、{object|物體}及{Attribute Value|屬性值},知網的{Secondary Feature|次要特徵}及{Proper Noun|專有名詞}則打散放入{object|物體}中的適當位置,如專名多為國名,即放入{country|國家}之下。以上是屬於實質概念的部份。由於詞彙的核心語意是由實質概念與關係概念組成,因此相對於{entity|事物},我們另立一棵{relation|關係}的概念樹,其下包含表達概念與概念關係的{SemanticRole|語意角色},以及表示複雜關係的{function|函數}。原知網下的{Attribute|屬性}則打散放入語意角色之下。區別特徵與特徵值。知網知識本體中,有些概念是特徵,有些是特徵值。如{animate|生物}下分為{AnimalHuman|動物}、{plant|植物};{domain|領域}下分為{industrial|工}、{agricultural|農}等等。動物和植物是生物的下位概念,本身也是特徵,因此它們可以繼續細分下位詞如人、獸、花、草等等;相反的,士、農、工、商是領域的特徵值,它們之下不再分詞。廣義知網知識本體在作為特徵值的下位概念前一律加上錢形符號($)以資區別。新增及修訂義原。為精確表達語義,廣義知網新增及修訂若干義原如下:方位義原—知網原採擬人方式表達方位,如以{head|頭}作為「山頂」的中心語,以{hand|手}表示「把手」;廣義知網則新增{TopPart|頂端}、{CentrePart|中心}、{BodyPart|軀幹}、{skeleton|骨架}、{BasePart|底部}、{branch|分枝}、{grip|柄}、{passage|通道}、{EndPart|尾部}及{surface|表層}等十個方位義原,用來表達方位,繫於概念{PartPosition|部件位置}之下。定詞義原—為區分概念之一般或限定用法,廣義知網新增六個定詞義原:{nonreferential|無指}、{referential|有指}、{generic|通指}、{individual|專指}、{definite|定指}及{indefinite|不定指},繫於屬性值下的「定指值」之下。所有專名均加註{definite|定指},如「加拿大」定義為:def: {country|國家:name="Canada|加拿大",place={NorthAmerica|北美},quantifier={definite|定指}}。時間義原—為表達概念中時間與事件的關係,新增{SpeakingTime|說話時間}、{ReferenceTime|相對時間}、{TimeNear|時間近}及{TimeFar|時間遠}等四個義原。分別繫於{TimeSection|時段}與{TimingValue|時間特性值}之下。如「古籍」可定義為:def: {publications|書刊:TimePoint=TimeBefore(SpeakingTime|說話時間),TimeFeature={TimeFar|時間遠}}代名詞義原—知網原以{1stPerson|我}、{2stPerson|你}及{3stPerson|他}表示人稱,廣義知網則以{speaker|說話者}、{listener|聽者}和{3stPerson|他人}取代之,後者在運用上較為靈活。如「令尊」可表達為:def:{human|人=father(listener|聽者)}。 功能詞義原—知網以義原{FuncWord|功能詞}作為一切功能詞的中心語,另以comment、concession等少數角色區別包括法相詞、連接詞、介詞在內的功能詞語義,不僅未能表現其連結概念與概念關係的功能,亦過度簡化了功能詞的語義。廣義知網表達了功能詞的關係義,例如:介詞「被」可定義為 def: Agent={}。為此也新增了{necessity|一定}、{possible|可能}等等法相義原以及epistemic、deontic、hypothesis、avoidance等評價及連接功能的角色。詳見事件角色下的{Extended Modality}及函數下的{conjunction}。新增及修訂語義角色分類架構。廣義知網重整知網語意角色如下圖。圖中有刪除號者表示原知網角色被刪除掉的部份;藍色字體表示知網角色保留者;除此之外均為廣義知網新增的部份。另外,因為少數語義角色僅接受少數義原為其值,其角色值義原以方形圖框加以標示,其他橢圓圖框內文字均表示角色。
  9. The original query “computer”, for instance, maybe expanded to include “client, server, website, etc.” In other words, with thoseexpanded words together, the system could raise both the query precision and recall.In recent years, huge amount of information is posted on the Web and it continues toincrease with an explosive speed. But we cannot access to the information or use itefficiently and effectively unless it is well organized and indexed. Many searchengines have been created for this need in current years. Web users, however, usuallysubmit only one single word as their queries on the Web [5], especially for a WebImage queries. It is even worse that the users’ query words may be quite different tothe ones used in the documents in describing the same semantics. That means a gapexists between user’s query space and document representation space. This problemresults in lower precisions and recalls of queries. The user may get an overwhelmingbut large percent of irrelevant documents in the result set. In fact, this is a toughproblem in Web information retrieval. An effective method for solving the aboveproblems is query expansion. In this paper, we provide a novel query expansionmethod based on the combination of WordNet [2], an online lexical system, and TSN,a term semantic network extracted from the collection. Our method has beenemployed in our Web image search system [4].3.1 Keyword ExpansionThe query keyword used by users is the most significant but not always sufficient inthe query phase. For example, if a user query with “computer”, he only can get theobject indexed by “computer”. We use WordNet and TSN to expand the query. WithWordNet, we expand the query along three dimensions including hypernym,hyponymy and synonym relation [2]. The original query “computer”, for instance, maybe expanded to include “client, server, website, etc.” In other words, with thoseexpanded words together, the system could raise both the query precision and recall.To extract TSN from the collection, we use a popular association mining algorithm– Apriori [12] — to mine out the association rules between words. Here, we onlyconsider one-to-one term relationship. Two functions—confidence and support— areused in describing word relations. We define confidence (conf) and support (sup) ofterm association ti 􀃆 tj as follows, let( i , j ) ( i ) ( j ) D t t = D t ∩ D t (1)where D(ti) and D(tj) stand for the documents including term ti and ti respectively.Therefore, D(ti)∩ D(tj) is the set of documents that include both ti and tj. We define|| ( ) |||| ( , ) ||ii jt tD tD t tConf i− > j = (2)where|| ( , ) || i j D t t stands for the total number of documents that include both term ti,and tj; and || ( ) || i D t stands for the total number of documents that include ti ,DD t tSup i jtitj|| ( , ) ||− > = (3)where D stands for the number of document in the database.Those relationships are extracted and represented with two matrixes, we could usethem to expand the query keywords. For example, the keyword “computer” has thehighest confidence and support with the words “desktop, series, price, driver…etc”which are not described in WordNet but can be used to expand the original query.
  10. In recent years, huge amount of information is posted on the Web and it continues toincrease with an explosive speed. But we cannot access to the information or use itefficiently and effectively unless it is well organized and indexed. Many searchengines have been created for this need in current years. Web users, however, usuallysubmit only one single word as their queries on the Web [5], especially for a WebImage queries. It is even worse that the users’ query words may be quite different tothe ones used in the documents in describing the same semantics. That means a gapexists between user’s query space and document representation space. This problemresults in lower precisions and recalls of queries. The user may get an overwhelmingbut large percent of irrelevant documents in the result set. In fact, this is a toughproblem in Web information retrieval. An effective method for solving the aboveproblems is query expansion. In this paper, we provide a novel query expansionmethod based on the combination of WordNet [2], an online lexical system, and TSN,a term semantic network extracted from the collection. Our method has beenemployed in our Web image search system [4].3.1 Keyword ExpansionThe query keyword used by users is the most significant but not always sufficient inthe query phase. For example, if a user query with “computer”, he only can get theobject indexed by “computer”. We use WordNet and TSN to expand the query. WithWordNet, we expand the query along three dimensions including hypernym,hyponymy and synonym relation [2]. The original query “computer”, for instance, maybe expanded to include “client, server, website, etc.” In other words, with thoseexpanded words together, the system could raise both the query precision and recall.To extract TSN from the collection, we use a popular association mining algorithm– Apriori [12] — to mine out the association rules between words. Here, we onlyconsider one-to-one term relationship. Two functions—confidence and support— areused in describing word relations. We define confidence (conf) and support (sup) ofterm association ti 􀃆 tj as follows, let( i , j ) ( i ) ( j ) D t t = D t ∩ D t (1)where D(ti) and D(tj) stand for the documents including term ti and ti respectively.Therefore, D(ti)∩ D(tj) is the set of documents that include both ti and tj. We define|| ( ) |||| ( , ) ||ii jt tD tD t tConf i− > j = (2)where|| ( , ) || i j D t t stands for the total number of documents that include both term ti,and tj; and || ( ) || i D t stands for the total number of documents that include ti ,DD t tSup i jtitj|| ( , ) ||− > = (3)where D stands for the number of document in the database.Those relationships are extracted and represented with two matrixes, we could usethem to expand the query keywords. For example, the keyword “computer” has thehighest confidence and support with the words “desktop, series, price, driver…etc”which are not described in WordNet but can be used to expand the original query.
  11. 關聯規則其主要目的是找出交易中可能相關連的產品項目。在關聯規則中最具代表性的方法就是由而Agrawal等學者於 1994 提出的Apriori演算法(Agrawal et. al. 1994) 。Apriori演算法的運作中包含兩個步驟,第一步驟,找出所有滿足最小支持度的頻繁項目集,第二步驟,找出滿足最小信賴度的規則,也就是利用第一步驟所找出的頻繁項目集資訊來求得所有的關聯規則。Apriori演算法不斷重覆進行資料庫的掃描,找出所有頻繁項目集,直到無法產生新的候選項目集為止。 黃仁鵬與林廷鴻(2011)學者提出將Apriori演算法作改良,產生出一種用於網頁探勘用的演法,利用「關鍵詞彙」與「網站」兩種支持度,判斷出頻繁項目集的關鍵性資訊所支持的網站與頻繁項目集的網站所支持的關鍵性資訊,是一個具有關鍵字與網頁兩種支持度的Apriori演算法。圖2 為改良式Apriori演算法範例。 藉此可以過濾掉搜索者不需要的網頁,且可分析出每個網頁內容的相關性,找出可能有包含搜索者所需資訊的網頁隱形 戰機 维基百科 自由 百科全书 维基百科隱形 飛機 是 降低 飛機 電 光 聲 可 探測 特徵 使 雷達 探測器美 部署 隱形 戰機 對抗 中國
  12. 實驗顯示 在使用以上可能誤用的查詢字 最後的系統的建議結果都有包含使用者預查詢的主題
  13. 但經過實驗我們也發現如果本身其同義字的延伸組合包含很多有效主題 例如關鍵字威力彩 其造成噪音字過濾效率下降 最後網頁檢索精確率的表現可能差異就會不大