This is the slides used in our 3-hour tutorial at VLDB'2014.
Yunyao Li, Ziyang Liu, Huaiyu Zhu: Enterprise Search in the Big Data Era: Recent Developments and Open Challenges. PVLDB 7(13): 1717-1718 (2014)
Abstract:
Enterprise search allows users in an enterprise to retrieve desired information through a simple search interface. It is widely viewed as an important productivity tool within an enterprise. While Internet search engines have been highly successful, enterprise search remains notoriously challenging due to a variety of unique challenges, and is being made more so by the increasing heterogeneity and volume of enterprise data. On the other hand, enterprise
search also presents opportunities to succeed in ways beyond current Internet search capabilities. This tutorial presents an organized overview of these challenges and opportunities, and reviews the state-of-the-art techniques for building a reliable and high quality enterprise search engine, in the context of the rise of big data.
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
1. Enterprise Search in the
Big Data Era
Yunyao Li
Ziyang Liu
Huaiyu Zhu
IBM Research - Almaden
NEC Labs
IBM Research - Almaden
2. 1
Enterprise Search
⢠Providing intuitive access to an organizationâs
various digital content
1
Report Find
⢠IDC report [IDC 05] ⢠$5k/person/year wasted salary due to poor search
⢠9-10hr/person/week doing search
⢠unsuccessful 1/3-1/2 of the time
⢠Butler Group
[Edwards 06]
⢠10% of salary cost wasted through ineffective search
⢠Accenture survey
[Accenture 07]
⢠Middle managers spend 2 hr/day searching
⢠>50% of what they found have not value
⢠Hawking, Enterprise Search, http://david-hawking.net/pubs/ModernIR2_Hawking_chapter.pdf
⢠[IDC 05] the enterprise workplace: How it will change the way we workâ. IDC Report 32919
⢠[Edwards 06] www.butlergroup.com/pdf/PressReleases/ESRReportPressRelease.pdf
⢠[Accenture 07] http://newsroom.accenture.com/article_display.cfm?article_id=4484
3. 2
Magic
Search from Userâs Point of View
Results
1 âŚâŚâŚâŚâŚ..
2 âŚâŚâŚ.
3 âŚâŚâŚâŚâŚ..
4 âŚâŚâŚ.
âŚâŚâŚâŚ
âŚâŚâŚâŚâŚ
INTRODUCTION SEARCH
4. 3
What Happens Behind the Scene
Backend
Collect data
Analyze data
Index data
Frontend
Serve user queries
Return results
Index
Data
Source
INTRODUCTION SEARCH
5. 4
How Does a Query Match a Document?
Index
Document
âŚâŚâŚâŚâŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚâŚ
⌠âŚ.. âŚâŚ..
âŚâŚâŚâŚâŚâŚâŚ
âŚâŚâŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚâŚ
⌠âŚ.. âŚâŚ..
âŚâŚâŚâŚâŚâŚâŚ
âŚâŚâŚâŚ
Results
Doc 1 âŚâŚâŚ..
Doc 2 âŚâŚ.
Doc 3 âŚâŚâŚâŚ..
Doc 4 âŚâŚâŚ.
âŚâŚâŚâŚ
âŚâŚâŚâŚâŚ
Analyze query
Present results
Analyze document
Search index
Build index
INTRODUCTION SEARCH
6. 5
Search Is More Than Keyword Match
⢠Specific features in documents are important
â Title, url, person name, product, actions, âŚ
⢠Features combine to form higher level concepts
â In document: Home page + person personal homepage
â Cross document: URL link analysis, âŚ
⢠The string representation in document may not match that in
user query
â Person name: Bill Clintonď William Jefferson Clinton
⢠User queries may be ambiguous
â Multiple interpretations
⢠Presenting the results to user
â Ranking, grouping, interactive refinement
INTRODUCTION SEARCH
7. 6
Internet vs Enterprise â Web data
[Fagin WWW2003]
Internet Enterprise
Creation of
content
⢠Democratic
⢠Appealing to reader
⢠Links approval
⢠Bureaucratic
⢠Conform to mandate
⢠Links internal structure
Relevant
query results
⢠Large number
⢠Overlapping information
⢠Reasonable subset suffices
⢠Ranking is more universal
⢠Small number
⢠Specific function
⢠Specific pages required
⢠Ranking is relative to query
Spamming
⢠Spam infested
⢠Ranking can only be based
on external authority
⢠Mostly spam-free
⢠Ranking based on content
or metadata are reliable
Search
engine
friendliness
⢠Web pages designed to be
search results
⢠Web page document
⢠Documents not designed to
be search results
⢠Special treatment
INTRODUCTION ENTERPRISE VS INTERNET
8. 7
Internet vs Enterprise â Big Data
Internet Enterprise
Content being
searched
⢠Sources: Web crawl
⢠Formats: html, xml, pdf,
⢠Variety of sources
⢠Variety of formats:
⢠Email, database, application-
specific access and formats
Search queries
/expected
results
⢠Target: web pages, office
documents
⢠Expect list of documents
⢠Expect little personalization
⢠Return result directly
⢠Target: rows, figures, experts, ...
⢠Expect customized results
⢠Personalization required:
geography, access,
⢠Customize results
Related
information
⢠Link approval
⢠Small number of domain-
specific knowledge
⢠Generic analysis
⢠Link organization structure
⢠Large number of dynamic
domain-specific knowledge
⢠Highly specialized analysis
Skill set of
search admins
⢠Large number of admins
⢠Search experts
⢠Facilitate update of search
algorithms
⢠Small number of admins
⢠Domain experts
⢠Facilitate use of domain
knowledge
INTRODUCTION ENTERPRISE VS INTERNET
9. 8
Search Engine Components
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control/improvement
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
INTRODUCTION TUTORIAL OVERVIEW COMPONENTS
10. 9
Search Engine Architecture
Backend
Collect data
Analyze data
Store and index data
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
11. 10
Main Backend Functions
Analysis (Understand)
Information extraction
Analyse and transform data
Indexing (Prepare for search)
Generate terms suitable for match queries
Index search terms
index
Document Ingestion (Collect)
Collect all the data to be searched
Transform and store as documents
Local Analysis
(in-document analysis)
Global Analysis
(cross-document analysis)
13. 12
Typical analytics pipeline
S1={f11, f12, âŚ}
S2={f21, f22, âŚ}
S3={f31, f32, âŚ}
G1 = {g1, âŚ}
G2 = {g2, g3, âŚ}
LA GA Idx
Data ingestion
⢠Collect data
⢠Transform to uniform
document format
⢠Store in document store
Data ingestion
⢠Collect data
⢠Transform to uniform
document format
⢠Store in document store
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Global analysis
⢠Cross document analysis
⢠Rank, group, merge, and
filter documents
Global analysis
⢠Cross document analysis
⢠Rank, group, merge, and
filter documents
index
Indexing
⢠Generate search terms,
⢠Index documents by
search terms
Indexing
⢠Generate search terms,
⢠Index documents by
search terms
Local analysis:
⢠Information extraction
from each document
Local analysis:
⢠Information extraction
from each document
DI
BACKEND OVERVIEW
14. 13
Digression: Classical IR
S1={f11, f12, âŚ}
S2={f21, f22, âŚ}
S3={f31, f32, âŚ}
G1 = {g1, âŚ}
G2 = {g2, g3, âŚ}
LA GA Idx
Data ingestion
⢠Given set of files
Data ingestion
⢠Given set of files
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Global analysis
⢠Calculate statistics of
terms in documents
Global analysis
⢠Calculate statistics of
terms in documents
index
Indexing
⢠Generate search terms,
⢠Index by terms with
statistics
Indexing
⢠Generate search terms,
⢠Index by terms with
statistics
Local analysis:
⢠Tokenize
⢠Stop wording
⢠Stemming
⢠Form n-grams
Local analysis:
⢠Tokenize
⢠Stop wording
⢠Stemming
⢠Form n-grams
DI
BACKEND OVERVIEW
15. 14
Digression: Classical Web search
S1={f11, f12, âŚ}
S2={f21, f22, âŚ}
S3={f31, f32, âŚ}
G1 = {g1, âŚ}
G2 = {g2, g3, âŚ}
LA GA Idx
Data ingestion
⢠Crawl web pages
Data ingestion
⢠Crawl web pages
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Global analysis
⢠Calculate eigenvalues of
connection matrix
Global analysis
⢠Calculate eigenvalues of
connection matrix
index
Indexing
⢠Generate search terms
⢠Index documents by
search terms, with page
rank
Indexing
⢠Generate search terms
⢠Index documents by
search terms, with page
rank
Local analysis:
⢠Extract out links
Local analysis:
⢠Extract out links
DI
BACKEND OVERVIEW
16. 15
Demands of Enterprise Search
S1={f11, f12, âŚ}
S2={f21, f22, âŚ}
S3={f31, f32, âŚ}
G1 = {g1, âŚ}
G2 = {g2, g3, âŚ}
LA GA Idx
Data ingestion
⢠Handle variety of sources
⢠Handle variety of formats
⢠Deal with access policy
⢠Deal with update policy
Data ingestion
⢠Handle variety of sources
⢠Handle variety of formats
⢠Deal with access policy
⢠Deal with update policy
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Global analysis
⢠Cross document analysis
⢠Rank, group, merge, and
filter documents
Global analysis
⢠Cross document analysis
⢠Rank, group, merge, and
filter documents
index
Indexing
⢠Generate search terms,
⢠Index documents by
search terms
Indexing
⢠Generate search terms,
⢠Index documents by
search terms
Local analysis:
⢠Incorporate domain knowledge
⢠Extract rich set of semantics
⢠Categorize documents
Local analysis:
⢠Incorporate domain knowledge
⢠Extract rich set of semantics
⢠Categorize documents
DI
BACKEND OVERVIEW
17. 16
⢠Efficient incremental updates
â Fast turn around time for updates
⢠System performance and reliability
â Scaling with data size and resource available
â Fault tolerance
⢠Ease of administration quality improvement
â Allow search admin to customize domain specific
configurations
BACKEND OVERVIEW CHALLENGES / OPPORTUNITIES
Desiderata of backend
19. 18
Data Ingestion
BACKEND DATA INGESTION
Doc. Store
Crawl/push
Web DB App
Convert to
document
âŚ
Convert
to text
From: xxx
To: yyy
Date: 12/21
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
Attch: file1.pdf
Docid: 0002
___________
âŚâŚ.ABCDâŚ..
⌠01/12 âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚ
âŚâŚâŚ..
⌠âŚâŚâŚ
Docid: 0001
___________
From: xxx
To: yyy
Date: 12/21
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
Attch: file.pdf
Email +attach
Docid: 0002
___________
title: ABCD.
Date: 01/12
âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚ
âŚâŚâŚ..
⌠âŚâŚâŚ
Docid: 0001
___________
From: xxx
To: yyy
Date: 12/21
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
Attch: file.pdf
Variety of
sources
Support update &
retention policy
Pdf file
20. 19
Document-centric View
⢠Data as a collection of documents
â Document as unit of storage and search result.
â Three major components
⢠Unique document identifier in the whole system
⢠Metadata fields: url, date, language, âŚ
⢠Content field: text to be searched
⢠Representation of data of different structures
â Web pages Each page is a document
â Relational data Each row is a document
â Hierarchical data Each node is a document
BACKEND DATA INGESTION
21. 20
Push vs Pull
Pull Push
Definition
⢠Search engine initiate
transfer of data
⢠(Web crawler)
⢠Content owner initiate transfer
of data
⢠(Apps with push notice)
Advantage
⢠Operated by search engine
⢠Use standard crawlers
⢠Can handle special access
methods
⢠Easy to adjust refresh rate
⢠Easy to handle special format
Disadvantage
⢠Difficult to access special
data sources
⢠Difficult to adjust domain
specific treatment
⢠Need synchronization with
content owner
Applicability
⢠Prevalent for Internet
⢠Also useful for enterprise
⢠Rare for Internet
⢠Very important for enterprise
BACKEND DATA INGESTION
22. 21
Transform the Data
⢠Format conversion
â Convert content to text: pdf, doc, âŚ
⢠Keep as much structure as possible
⢠Metadata conversion
â Obtain and transform metadata: HTTP headers,
DB table metadata, âŚ
⢠Merge /split documents
â One-to-many: Zip file, email thread, attachments
â Many-to-one: social tags merge to original doc
BACKEND DATA INGESTION
23. 22
Storage options
Options Pro Con
SQL database
⢠Traditional RDBM strengths
⢠Support insert, update,
delete, fielded query,
⢠Too much system overhead
Indexing
engine
(Lucene)
⢠Closer to document centric
view
⢠Supports insert, delete,
fielded query
⢠No direct in-document update
⢠Need special treatment for
distributed processing
NoSQL
databases
⢠Light weight
⢠Sufficient for simple use
⢠May lack features in the future
⢠Transaction?
BACKEND DATA INGESTION
Issues to consider
⢠In document update
⢠Access/Retention policy
⢠Parallel processing
25. 24
Local Analysis
⢠Annotating pages
â Extract structured elements: title, header, âŚ
â Extract features for people, projects,
communities, âŚ
â Extract features for cross-document analysis.
⢠Categorizing pages
â Label by standard categories
⢠Language, geography, date, âŚ
â Label pages by custom categories
⢠IBM examples: HR, person, IT help, ISSI, sales information,
marketing, corporate standards, legal & IP-law, âŚ
Local analysis is essentially information extraction
BACKEND LOCAL ANALYSIS
26. 25
Rule-based IE ML-based IE
PRO
⢠Declarative
⢠Easy to comprehend
⢠Easy to maintain
⢠Easy to incorporate domain
knowledge
⢠Easy to debug
⢠Trainable
⢠Adaptable
⢠Reduces manual effort
CON
⢠Heuristic
⢠Requires tedious manual
labor
⢠Requires labeled data
⢠Requires retraining for domain
adaptation
⢠Requires ML expertise to use or
maintain
⢠Opaque (not transparent)
BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION
Rule-based vs. Learning-based IE
27. 26
Commercial Vendors
(2013)
NLP Papers
(2003-2012)
100%
50%
0%
3.5%
21%
75%
Rule-
Based
Hybrid
Machine
Learning
Based
45%
22%
33%
Large Vendors
67%
17%
17%
All Vendors
⢠GATE Information Extraction
⢠IBM InfoSphere BigInsights
⢠Microsoft FAST
⢠SAP HANA
⢠SAS Text Analytics
⢠HP Autonomy
⢠Attensity
⢠Clarabridge
Example Industrial Systems
Source: [CLR2013] Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!, EMNLP 2013
BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION
Landscape of Entity Extraction
Implementations
28. 27
Intranet
page
NavPanel Extraction
NavPanels
Self link
identification
Title Extraction
Matching title
patterns
Titles
Dictionary
Match
Person name
dictionary
Person name in title
Title Extraction
Matching title
patterns
Titles
Title Name
URL Extraction
URLs
Matching URL
patterns
URL Name
Person name dictionary = employee directory
IBM Global Services Security Home
IBM Global Services Security
G J Chaitin Home Page
G J Chaitin
1. http://w3-03.ibm.com/marketing/
2. http://w3-03.ibm.com/isc/index.html
3. http://chis.at.ibm.com/
1. marketing
2. isc
3. chis
BACKEND LOCAL ANALYSIS EXAMPLES
[Zhu et al., WWWâ07]
Local analysis for different features
29. 28
Consolidation
â Example: Document language consolidation
⢠HTTP header Accept-Language: en-us,en;q=0.5
⢠Meta tags <meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />
⢠Document text encoding
⢠URL http://enterprise.com/hr/benefits/us/ca/
BACKEND LOCAL ANALYSIS TRANFORMATIONS
31. 30
Global Analysis
⢠Deduplication
â Save resources, reduce result clutter
⢠Identify root of URL hierarchy
â Used for result grouping and ranking
⢠Anchor text analysis
â Assign external labels to documents
⢠Social tagging analysis
â Assign tags and their weights to documents
⢠Identify different versions of the same document
â Due to variations in date, language, âŚ
⢠Enterprise-specific global analysis
â When certain documents co-exists, do this âŚ
⢠âŚ
BACKEND GLOBAL ANALYSIS
32. 31
Shingle based deduplication
(Leskovec, http://www.mmds.org/)
S1={s1, s2, âŚ}
S2={s1, s3, âŚ}
S3={s2, s3, âŚ}
{h1(S1), h2(S2), âŚ}
{h1(S2), h2(S2), âŚ}
{h1(S3), h2(S2), âŚ}
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Shingles:
⢠Character or token n-gram
⢠Possibly stemmed
⢠Possibly related to stop words
Shingles:
⢠Character or token n-gram
⢠Possibly stemmed
⢠Possibly related to stop words
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Minhash:
⢠Maps sets to integers
⢠Based on permutation of universal set
Jaccard distance :
Theorem:
The probability that the minhash function for a random
permutation of rows produces the same value for two sets
equals the Jaccard similarity of those sets
Minhash:
⢠Maps sets to integers
⢠Based on permutation of universal set
Jaccard distance :
Theorem:
The probability that the minhash function for a random
permutation of rows produces the same value for two sets
equals the Jaccard similarity of those sets
| AâŠB | / | AâŞB |
More diverse set of documents. More precise.
BACKEND GLOBAL ANALYSIS DEDUPLICATION
33. 32
Metadata-based deduplication
(IBM Gumshoe search engine)
S1=[h11, h12, âŚ]
S2=[h21, h22, âŚ]
S3=[h31, h32, âŚ]
G1 = {S1, âŚ}
G2 = {S2, S3, âŚ}
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Significant metadata:
⢠Document title
⢠Section headers
⢠Signatures from URL
Ensure that all similar candidates
have the same signature
Significant metadata:
⢠Document title
⢠Section headers
⢠Signatures from URL
Ensure that all similar candidates
have the same signature
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Group by signature
⢠Perform detailed analysis
In-group similarity analysis:
⢠Analyze documents within candidate groups
Group by signature
⢠Perform detailed analysis
In-group similarity analysis:
⢠Analyze documents within candidate groups
More customizable for intranet. Less cost.
BACKEND GLOBAL ANALYSIS DEDUPLICATION
34. 33
URL Root Analysis (Zhu et al., WWWâ07)
host1/b/a/~user1/pub
host1/b/a
host1/b/a/~user1/
host1/b/c
host1/b/a/x_index.htm/ host1/b/c/d host1/b/c/home.html
host1/b/c/d/e/index.html?a=us host1/b/c/d/e/index.html?a=uk
host1/b/c/d/e/index.html
⢠Given a set of documents all with the same value V of feature X.
⢠E.g., At one time all webpages from IBM Tucson site had the same title
⢠Find the roots of URL forest. These will be preferred result for query X=V.
⢠E.g., when searching for âTuscon home pageâ, only the IBM Tuscon homepage will match.
BACKEND GLOBAL ANALYSIS ROOT ANALYSIS
35. 34
Label Assignment (Zhu et al., WWWâ07)
BACKEND GLOBAL ANALYSIS LABEL ASSIGNMENT
Document B
âŚâŚâŚâŚâŚâŚ..
⌠âŚâŚ âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Document A1
âŚâŚâŚâŚâŚâŚ..
⌠X home âŚ
âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚDocument A2
âŚâŚâŚâŚâŚâŚ..
⌠X home âŚ
âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Bookmark C1
X home
Anchor text global
analysis:
⢠Assign label âXâ and
/ or âYâ based on
frequency
Bookmark C2
X
Bookmark C3
Y home
Document A2
âŚâŚâŚâŚâŚâŚ..
⌠X home âŚ
âŚâŚâŚâŚ
⌠⌠âŚâŚ..
âŚâŚâŚâŚ âŚâŚâŚ
âŚâŚ âŚâŚ
Social tagging
global analysis:
⢠Assign label âX
homeâ, âXâ, and âY
homeâ based on
frequency
36. 35
Entity Integration using HIL
Entity Population Rules
⢠Create entities (from raw records, other
entities, and links)
⢠Clean, normalize, aggregate, fuse
Various data
sources
Information
Extraction
Entity
Resolution
Fuse
Aggregate
Entity Integration
Entity Resolution
Rules
⢠Create links between
raw records or entities
Map
Unstructured
Data
Unified
entities
Defines entity types (the logical
data model of the integration flow)
(SQL-like) rules to specify the
integration logic
Raw
Records
HIL
[HernĂĄndez et
al, EDBTâ13]
Declarative IE
(IBM SystemT)
[Chiticariu et al, ACL
2010]
Optimizing compiler to Big Data runtime (Jaql and Hadoop)
BACKEND GLOBAL ANALYSIS ENTITY INTEGRATION
38. 37
Indexing
⢠Generate and index search terms, to be
matched by terms generated at runtime from
user queries.
⢠Challenges:
â Extracted terms do not match user query terms
⢠Morphological changes, synonyms, âŚ
â Importance of term depends on query
⢠Needs for bucketing of indexes, âŚ
â Support of incremental indexing
BACKEND INDEXING
39. 38
Term normalization
⢠Example: Date time normalization
â Given any of these
Wed Aug 27 10:06:11 PDT 2014
27 Aug 2014, 10:06:11
2014-08-27T10:06:11-07:00
27 Aug 2014
1409133971
â Normalize to 2014-08-27T10:06:11-07:00
â Other examples: Person names, product names,
âŚ
BACKEND INDEXING TERM NORMALIZATION
40. 39
Why Generate Variant Terms?
⢠Extracted feature string â query string
â People names
⢠Document: John Doe Search: Doe, John Search: J Doe
â Acronym expansions
⢠gts Global Technology Services
â N-gram variant generation
⢠Title: reimbursement of travel expenses
⢠Terms: reimbursement, travel expenses, reimbursement travel,
reimbursement of travel, reimbursement expenses
⢠Normalization is not sufficient solution
â People names
⢠Document: John Doe J. Doe Search: Jean Doe J. Doe
⢠These are not supposed to match
⢠Solution:
â Generate variant terms with different levels of approximation.
BACKEND INDEXING VARIANT TERM GENERATION
41. 40
Configurable Term Generation
⢠Configuration knobs determine the set of outputs
⢠Given âMr. John (Jack) M. Doe Jr.â
â Configuration1:
Initial=both, Dot: with, NickName: both, MiddleName: both, NameSuffix:
without, Title: without, Comma:both
John M. Doe Doe, John M.
John Doe Doe, John
J. M. Doe Doe, J. M.
J. Doe Doe, J.
Jack M. Doe Doe, Jack M.
Jack Doe Doe, Jack
â Configuration2 (normalization):
Initial=without, Dot: without, NickName: without, MiddleName: without,
NameSuffix:without, Title: without, Comma: without
John Doe
BACKEND INDEXING VARIANT TERM GENERATION
43. 42
Search Engine Architecture
Backend
Collect data
Analyze data
Store and index data
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control
Frontend
Interpret user query
Search index
Present results
Interact with user
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
44. Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)
45. 44
1. Ambiguity
⢠Optimal keywords may not be used.
â Misspelled
⢠âdatbaseâ
â Under-specified
⢠polysemy: âjavaâ
⢠too general: âdatabase papersâ
â Over-specified:
⢠synonyms, acronyms, abbreviations &
alternative names: âgreen cardâ âĄ
âpermanent residencyâ
⢠too specific: âMS Office 2007 for Mac x64
editionâ
â Non-quantitative:
⢠âsmall laptopâ
query cleaning query autocompletion
query refinement
query rewriting
query rewriting
46. 45
Summary of Solutions
⢠query cleaning
â correct various types of spelling errors
⢠query autocompletion
â prevent spelling errors.
⢠query refinement
â making queries more specific, returning fewer results.
⢠query rewriting
â making queries more general / on-topic, returning more
relevant results.
⢠query forms
â enabling users to specify precise queries
FRONTEND AMBIGUITY
47. 46
Graph-based Spelling Correction
(bao acl 11)
⢠Repartition the query.
â Each partition (token) should be plausible: confidence
(correcting it) > threshold.
â confidence: linear combination of multiple scores, parameters
learned from SVM.
⢠Domain knowledge is often used in calculating confidence.
⢠For each partition, generate candidate corrections with
high scores.
âenterpricsea rchâ
âenterpricse archâ
âenterpric searchâ
âenter pric searchâ
etc.
price: 0.8
prim: 0.6
etc.
pric
QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY
âenterpricsea rchâ
48. 47
Graph-based Spelling Correction
(bao acl 11)
⢠Build a graph that connects candidate
corrections.
⢠Each full path is a candidate query.
â Find k top-weighted full paths
enterprise
enter
price
prim
arc
sea rich
search
1. correction score
(node weight)
2. merge penalty
(node weight)
3. split penalty
(edge weight)
enterprise â search
enter â price â sea â rich
e.g.,
weights
QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY
price: 0.8
prim: 0.6
etc.
pric
âenterpricsea rchâ
49. 48
Graph-based Spelling Correction
(bao acl 11)
⢠Weight doesnât consider term correlations.
⢠Calculate a score for each path
â Score includes term correlations.
⢠This ensures the cleaned query has good quality
results.
⢠Correlations are computed based on number of co-
occurrences.
⢠Finally returns paths with high scores.
e.g., correlation(âenterprise searchâ) > correlation (âenterprise arcâ)
QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY
e.g., âenterprise searchâ vs. âenterprise arcâ
50. 49
XClean (lu icde 11)
â based on the noisy channel model that finds the
intended word given the userâs input word.
â results on XML are subtrees rooted at entity nodes.
⢠A result quality score is calculated for each entity node in
T, and then aggregated.
⢠e.g., if Johnny and Mike works in the same department,
then âJohnn, Mikeâ â âJohnny, Mikeâ rather than âJohn,
Mikeâ.
â processes each word individually, i.e., no merge or
split.
Query Cleaning on Relational Data: Pu VLDB 08
related
department
head
Johnny
employees
âŚ
QUERY CLEANING STRUCTURED DATAFRONTEND AMBIGUITY
51. 50
Summary of Solutions
⢠query cleaning
â correct various types of spelling errors
⢠query autocompletion
â prevent spelling errors.
⢠query refinement
â making queries more specific, returning fewer results.
⢠query rewriting
â making queries more general / on-topic, returning more
relevant results.
⢠query forms
â enabling users to specify precise queries
FRONTEND AMBIGUITY
52. 51
Query Autocompletion
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
53. 52
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
Error-Tolerating Autocompletion
(chaudhuri sigmod 09)
desr
desert
dessert
deserve
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
54. 53
n
c
ae
x
Error-Tolerating Autocompletion
(chaudhuri sigmod 09)
data contains âsearchâ, âsandâ and âtextâ
max. edit distance = 1
no input input: s input: se input: sen
s
a
r
t
e
t
h
d
n
c
ae
x
s
a
r
t
e
t
h
d
n
c
ae
x
s
a
r
t
e
t
h
d
n
c
ae
x
s
a
r
t
e
t
h
d
Showing results instead of keywords
can be achieved
by associating inverted lists to trie nodes.
trie
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
55. 54
Tastier(li vldbj 11)
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
âhave a nniâ show results for âhave a nice dayâ
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
56. 55
Tastier(li vldbj 11)
⢠Trie-based (similar as previous paper).
â Trie leaf nodes are associated with inverted lists.
⢠To handle multiple keywords:
â Each record/document is associated with a sorted lists of
words in it (forward lists).
⢠so that a binary search can determine whether a string appears
in a record/document as a prefix.
⢠why not hash? Because we need to match prefix, not whole
words.
⢠Inverted list intersections are computed
incrementally using cache for improved efficiency.
âhave a nice dayâ âa, day, have, niceâ
example
forward list
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
57. 56
Phrase Prediction(nandi vldb 07)
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
a nice have a nice day
58. 57
Phrase Prediction(nandi vldb 07)
⢠Suggest phrases given the user input phrase.
â Need to find a good length of a suggested phrase.
⢠Too short: utility is small.
⢠Too long: low chance of being accepted.
⢠(modified) suffix tree-based.
â Each node is a word, rather than a letter.
â Why not use trie: phrases have no definitive starting point.
A phrase may start in the middle of a sentence (i.e., start at
a suffix of the sentence), hence suffix tree.
⢠Significant phrases.
laptop
have a nice day
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
59. 58
Summary of Solutions
⢠query cleaning
â correct various types of spelling errors
⢠query autocompletion
â prevent spelling errors.
⢠query refinement
â making queries more specific, returning fewer results.
⢠query rewriting
â making queries more general / on-topic, returning more
relevant results.
⢠query forms
â enabling users to specify precise queries
FRONTEND AMBIGUITY
60. 59
Query Refinement
⢠Motivation
â Some under-specified queries on large data
corpus have too many results.
â Ranking cannot always be perfect.
⢠Approaches
â Identifying important terms in results
(structured/unstructured)
â Clustering results
(structured/unstructured)
â Faceted search
(structured)
FRONTEND AMBIGUITY QUERY REFINEMENT
61. 60
Using Clustered Results (liu pvldb 11)
All suggested queries are about
programming language.
It is desirable to refine an ambiguous query
by its distinct meanings.
âJavaâ
FRONTEND AMBIGUITY QUERY REFINEMENT
62. 61
⢠â Input: clustered results
â clustering method is irrelevant.
â e.g., the result of âJavaâ may have 3 clusters
corresponding to Java language, Java island, and
Java tea.
⢠â Output: one refined query for each cluster.
Each refined query:
â maximally retrieves the results in its cluster
(recall)
â minimally retrieves the results not in its cluster
(precision)
Using Clustered Results (liu pvldb 11)
FRONTEND AMBIGUITY QUERY REFINEMENT
63. 62
Using Important Terms in Results
(tao edbt 09)
⢠For relational data only.
⢠Given a keyword query, it outputs top-k most
frequent non-keyword terms in the results,
without generating the results.
â Avoiding result generation is possible since the
terms are ranked only by frequency: tradeoff of
quality and efficiency.
Data Clouds (for structured data): Koutrika EDBT 09
(more sophisticated term ranking, but needs to generate query results first.)
related
FRONTEND AMBIGUITY QUERY REFINEMENT
64. 63
Faceted Search
all
location:
Sunnyvale, CA
location:
Phoenix, AZ
location:
Amherst, MA
department:
data management
department:
machine learning
1. How to select facets and facets conditions at each level, to
minimize the userâs expected navigation cost?
2. How to rank facets and facets conditions?
challenges
Chakrabarti SIGMOD 04 Kashyap CIKM 10
âŚâŚ
âŚâŚ
âŚâŚ
FRONTEND AMBIGUITY QUERY REFINEMENT
65. 64
Summary of Solutions
⢠query cleaning
â correct various types of spelling errors
⢠query autocompletion
â prevent spelling errors.
⢠query refinement
â making queries more specific, returning fewer results.
⢠query rewriting
â making queries more general / on-topic, returning more
relevant results.
⢠query forms
â enabling users to specify precise queries
FRONTEND AMBIGUITY
66. 65
Query Rewriting
⢠Motivation
â Synonyms, alternative names: âgreen cardâ vs
âpermanent residencyâ.
â Too specific: âMS Office 2007 for Mac x64 editionâ
â Non-quantitative: âsmall laptopâ
⢠Approaches
â Using query/click logs
â Finding rewriting rules from missing results
⢠e.g., replace âgreen cardâ with âpermanent residencyâ.
â Using âdifferential queriesâ
FRONTEND AMBIGUITY QUERY CLEANING
67. 66
Using Query and Click Logs (cheng
icde 10)
The availability of query and click logs
can be used to assess ground truth.
query Q
query log
click log
synonyms
hypernyms
hyponyms
of Q
âqueryâ âsearchâ
synonym
âMySQLâ âdatabaseâ
hypernym
âdatabaseâ âMySQLâ
hyponym
find and return historical queries
whose âground truthâ (via click
log) significantly overlaps with
top-k results of Q.
idea
FRONTEND AMBIGUITY QUERY CLEANING
68. 67
Automatic Suggestion of Rewriting
Rules from Missing Results (bao sigir 12)
⢠Challenges for automatically generating
rewriting rules:
â rules should be semantically natural.
â a new rule designed for one query may eliminate
good results of another query.
FRONTEND AMBIGUITY QUERY CLEANING
âgreen cardâ
result d is missing / should
be ranked higher
result d contains phrase
âpermanent residencyâ
rewriting rule:
green card â permanent residency
69. 68
â Input: query q, missed
desirable results d
â Output: selected
set of rules
Generate candidate
rules L â R.
⢠L: n-grams in q.
⢠R: n-grams in high-
quality fields of d.
Identify semantically
natural rules by
machine learning.
Greedily select a
subset of rules that
maximizes the
overall query quality.
Automatic Suggestion of Rewriting
Rules from Missing Results (bao sigir 12)
FRONTEND AMBIGUITY QUERY CLEANING
green card â permanent residency
green card â federal government
70. 69
Keyword++ (Entity Databases)
(xin pvldb 10)
âsmall IBM laptopâ
ID Product Name BrandName Screen Size Description
1 ThinkPad E545 Lenovo 15 The IBM laptop...small
businessâŚ
2 ThinkPad X240 Lenovo 12 This notebook...
To âunderstandâ a term, compare two queries that
differ on this term, and analyze the differences of
attribute value distributions in the results.
idea
e.g., to understand term âIBMâ, we can compare the results of
âIBM laptopâ vs. âlaptopâ.
FRONTEND AMBIGUITY QUERY CLEANING
71. 70
Suppose: âIBM laptopâ â 50 results, 30 having âbrand: Lenovoâ
âlaptopâ â 500 results, only 50 having âbrand: Lenovoâ
The difference on âbrand: Lenovoâ is significant,
reflecting the meaning of âIBMâ.
IBM brand: Lenovo
small order by size ASC
Offline: compute the best mapping for all terms in query log
Online: compute the best segmentation of the query (DP).
âlaptopâ
âsmall laptopâ
likewise:
Keyword++ (Entity Databases)
(xin pvldb 10)
FRONTEND AMBIGUITY QUERY CLEANING
72. 71
Summary of Solutions
⢠query cleaning
â correct various types of spelling errors
⢠query autocompletion
â prevent spelling errors.
⢠query refinement
â making queries more specific, returning fewer results.
⢠query rewriting
â making queries more general / on-topic, returning more
relevant results.
⢠query forms
â enabling users to specify precise queries
FRONTEND AMBIGUITY
73. 72
Offline: how many query forms, and which query
forms, should be generated?
⢠Too many â hard to find the relevant forms.
⢠Too few â limiting query expressiveness.
Online: how to identify query forms relevant to
usersâ search needs?
Query Forms
Enabling users to issue precise structured queries
without mastering structured query languages.
advantage
challenges
Baid SIGMOD 09 Jayapandian PVLDB 08 Ramesh PVLDB 11 Tang TKDE 13
FRONTEND AMBIGUITY QUERY FORMS
74. Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)
75. 74
2. Ranking
Ranking Method Categories
Unstructured Data
⢠represents queries and documents using vectors
⢠each component is a term; the value is its weight
⢠ranking score = similarity (query vector, result vector)
Structured Data
⢠a document â a node or a result (subgraph/subtree)
vector space model
proximity based ranking
âŚ
authority based ranking
âŚ
FRONTEND RANKING
76. 75
2. Ranking
Ranking Method Categories
Unstructured Data
⢠proximity of keyword matches in a document can
boost its ranking.
Structured Data
⢠weighted tree/graph size, total distance from root to
each leaf, semantic distance, etc.
vector space model
âŚ
authority based ranking
âŚ
proximity based ranking
FRONTEND RANKING
77. 76
2. Ranking
Ranking Method Categories
vector space model
âŚ
âŚ
Unstructured Data
⢠nodes linked by many other important nodes are
important.
Structured Data
⢠authority may flow in both directions of an edge
⢠different types of edges in the data (e.g., entity-entity
edge, entity-attribute edge) may be treated differently.
proximity based ranking
authority based ranking
FRONTEND RANKING
78. Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)
79. 78
3. Representation
⢠Enterprise corpus can be much more
heterogeneous than a collection of documents or
web pages.
⢠Different searches may have different types:
retrieving a document, a figure, a tuple, a
subgraph, analytical keyword queries, etc.
Result diversification
Result summarization
Result differentiation
solutions
FRONTEND REPRESENTATION
80. 79
Result Diversification
⢠Result diversification is essentially the same
problem as query refinement.
â e.g., Java â Java language, Java tea, Java island.
⢠Same techniques apply.
FRONTEND REPRESENTATION DIVERSIFICATION
81. 80
Result Summarization
⢠Unstructured data: lots of work on text
summarization in machine learning, natural
language processing and IR communities.
⢠Structured data:
â Size-l object summary (Relational)
â Result snippet (XML)
Das, CMU 07 (unpublished)
Nenkova, Mining Text Data 12
surveys
FRONTEND REPRESENTATION SUMMARIZATION
82. 81
Size-l Object Summary (fakas pvldb 11)
âŚâŚMikeâŚâŚ
first
window
âMikeâ
unstructured
Mike
paper paper patent patentâŚ
conference John âŚ
⌠⌠âŚ
⌠âŚ
?
structured
FRONTEND REPRESENTATION SUMMARIZATION
83. 82
Size-l Object Summary (fakas pvldb 11)
⢠Each tuple has:
â a static importance score.
⢠similar idea as PageRank
â a run-time relevance score.
⢠distance to result root
⢠connectivity properties to result root
⢠Objective: find a connected snippet of the result,
which consists of l tuples and has the maximum
score.
⢠Dynamic programming based solution.
Result snippet for XML: Liu TODS 10
related
FRONTEND REPRESENTATION SUMMARIZATION
84. 83
Result Differentiation
Result 1 Result 2
event: year 2000 2012
paper: title OLAP
data
mining
cloud
scalability
search
âNEC Labs Open Houseâ
result 1: a large table with many
people / papers / posters
result 2: a large table with many
people / papers / posters
âŚ
results result differentiation
vs. comparing different credit cards on a bank website:
only with pre-defined features.
FRONTEND REPRESENTATION DIFFERENTIATION
85. 84
4. Expert Search
documents in which a candidate and a topic co-occur
topics near a candidate in a document
problem solving / ticket routing history
userâs knowledge on a topic
⢠expert should be more knowledgeable
social relationship between expert and user
⢠problem solving is usually more effective if expert has a close
social relationship with user
external corpus
⢠many employees publish stuff externally, i.e., papers, blogs.
ways for judging an expert
Find an expert within an enterprise to solve a particular problem.
goal
FRONTEND EXPERT SEARCH
86. 85
Classical Methods
⢠Builds a feature vector for each expert using various
evidence
⢠Ranks experts based on query, using traditional
retrieval models
candidate model
⢠First finds documents related to query, then locates
experts in documents
⢠Mimics the process a human takes.
document model
Balog CIKM 08
survey
FRONTEND EXPERT SEARCH
87. 86
User-Oriented Model (smirnova ecir 11)
Users prefer experts who:
are more knowledgeable
than themselves.
knowledge gain: p(e|q) â p(u|q)
have a close social relationship
with themselves.
time-to-contact: shortest path
department
head
John
employees
âŚ
e = expert
u = user
FRONTEND EXPERT SEARCH
88. 87
Using Web Search Engine
(santos inf. process. manage. 11)
query q
result from intranet
web query qâ result from internetformulate
web query
search
intranet
corpus combine
candidateâs full name: âJeff Smisekâ
organizationâs name: âIBMâ
terms in q: âdata integrationâ
excluding results from organization: â-site:ibm.comâ
FRONTEND EXPERT SEARCH
89. 88
Ticket Routing (shao kdd 08)
new ticket: DB2 login failure
transferred to group A
transferred to group B
transferred to group C
resolved
How to find the best group and
reduce problem solving time?
Markov chain model
Using only previous routing
history (not ticket content)
FRONTEND EXPERT SEARCH
90. 89
Ticket Routing (shao kdd 08)
Pr(g|S)
possibility to route a ticket to
group g given previous groups S
Pr(g|S) includes the probability that:
⢠g can solve the ticket
⢠g can correctly re-route the ticket.
Train the Markov chain model from ticket routing history.
FRONTEND EXPERT SEARCH
91. Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)
92. 91
5. Privacy
It is sometimes desirable that the search engine doesnât
know which documents a user wants to retrieve.
⢠For users: privacy
⢠For enterprises: avoiding liability
user privacy
While a search engine answers individual keyword
searches, there are methods that perform multiple
searches and, from the answers, piece together
aggregate information about underlying corpus.
⢠Enterprises may not want to disclose such information to all
users.
data privacy
93. 92
User Privacy
Private Information Retrieval (PIR)
⢠old topic, tons of theoretical papers
Modifying search engine. e.g.,
⢠forcing it to forget user activities
⢠embellishing queries with decoy terms (Pang PVLDB 10)
Using ghost queries to obfuscate user intention (Pang ICDE 12)
⢠no change to search engine
⢠light-weight
solutions
It is sometimes desirable that the search engine doesnât
know which documents a user wants to retrieve.
⢠For users: privacy
⢠For enterprises: avoiding liability
user privacy
94. 93
Private Information Retrieval (PIR)
⢠Idea: retrieve more documents than needed.
⢠Naïve: retrieve the entire corpus.
⢠How to minimize the number of retrieved &
unneeded documents?
⢠Tons of theoretical papers on different variations
of the problem, e.g.,
â different computation power of the search engine
â different number of non-communicating corpus
replica.
Gasarch EATCS Bulletin 2004
survey
95. 94
Ghost Queries (pang icde 12)
⢠Challenges
â Generate ghost queries on topics different from userâs
topics of interest, and make it difficult for the search
engine to infer userâs topics.
â Ghost queries need to be meaningful/realistic, so that
they cannot be easily identified.
generate
ghost queries
ghost queries
discard ghost
query results
results
submit to
search engine
user query
96. 95
Ghost Queries (pang icde 12)
⢠(e1, e2) privacy model
â Given a user query, if the probability of a topic
increases more than e1, it should be reduced to
below e2 by the ghost queries.
⢠Topics are predefined.
⢠A ghost query must be coherent: all words in
the ghost query should describe common or
related topics.
⢠Randomized algorithm based solution.
97. 96
Data Privacy
While a search engine answers individual keyword searches, there
are methods that perform multiple searches and, from the answers,
piece together aggregate information about underlying corpus.
⢠Enterprises may not want to disclose such information to all users.
data privacy
inserting dummy tuples OR randomly generating attribute values
⢠only applicable to structured data
disallowing certain queries OR return snippets
⢠search quality loss
altering a small number of results: adding dummy results;
modifying results, hiding some results (Zhang SIGMOD 12)
solutions
FRONTEND PRIVACY
98. 97
Aggregate Suppression (zhang sigmod 12)
⢠Example: consider corpus A and B.
â A: n documents
â B: 2n documents
â A â B
⢠Goal: suppress COUNT(*), i.e., adversary cannot tell which
corpus is larger.
⢠Naïve approach 1: deterministically remove n documents from B.
â achieves the goal, but with search utility loss: those n documents can
never be retrieved.
⢠Naïve approach 2: randomly drop half of the results at run time.
â no search utility loss, but fails to achieve the goal: a clever adversary
can still get the information.
FRONTEND PRIVACY
99. 98
Aggregate Suppression (zhang sigmod 12)
⢠Algorithm ideas
â carefully adjusting query degree (number of
documents matched by a query) and document
degree (number of queries matching a
document) by document hiding at run-time.
â decline a query if its result can be covered by a
small number of previous queries. Return
previous query results instead.
FRONTEND PRIVACY
100. 99
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control/improvement
Admin
System performance
Search quality control/improvement
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
Tutorial Outline
101. 100
Enterprise Search Administrators
⢠Main responsibilities
â Care and feeding of an enterprise search solution
⢠Monitor intranet help inboxes and respond to requests.
⢠Assist in troubleshooting intranet issues for content contributors
⢠Core skills required
â Understand general corporate business processes
â Experience in coordinating activities and managing
relationships
⢠with employees, content administrators, stakeholders, IT teams and
external agencies
Search Admin
Search administrators â â â â IR experts
Key Observation
Admin Overview
102. 101
What a Search Administrator Need?
Bad results
for query âŚ
Iâm missing the
golden URLâŚ
Result 22 should
be ranked much
higher!
Enterprise Users
Query Logs
Query âglobal
campusâ seems
unsatisfying
⢠Understand overall search
quality
⢠Overall trend
⢠YOY change
⢠By segmentation
⢠Understand individual search
results
⢠Why certain result is or
isnât brought back
⢠Its ranking
⢠Maintain search quality
⢠Underlying data evolves
⢠Terminology changes
⢠Policy/Business Process
changes
⢠Organization changes
⢠Hot topics
Search Admin
Admin Overview
105. 104
What a Search Administrator Need?
Bad results
for query âŚ
Iâm missing the
golden URLâŚ
Result 22 should
be ranked much
higher!
Enterprise Users
Query Logs
Query âglobal
campusâ seems
unsatisfying
⢠Understand overall search
quality
⢠Overall trend
⢠YOY change
⢠By segmentation
⢠Understand individual search
results
⢠Why certain result is or
isnât brought back
⢠Its ranking
⢠Maintain search quality
⢠Underlying data evolves
⢠Terminology changes
⢠Policy/Business Process
changes
⢠Organization changes
⢠Hot topics
Search Admin
Admin Examples
115. 114
Experience at IBM Internal Search
⢠IBM deployed a commercially available search engine
â Implementing standard IR techniques
⢠Search quality went down over time to the point that
Search results were unacceptable!
Success (⼠1 relevant results): 14% on top-1, 23% on
top-5, 34% on top-50! [Zhu et al., WWWâ07]
So, they implemented various solutionsâŚ
To the administrators managing the engine, exposed
control knobs were insufficient
Case Study Background
116. 115
Attempts to Improve Search
⢠Enhanced link analysis by
incorporating external links
to/from external WWW
⢠Creative hacks: added fake terms
to documents & queries
â # terms per document determined by
âpopularityâ: how much TF increase
required for needed rank boost ?
⢠Hard-coded custom results for the
top 1200+ queries
Didnât helpâŚ
Quality went down!
Maintenance nightmare:
Heuristic needs to be updated
upon each nontrivial change in
term stats./ranking parameters
Even bigger nightmare!
How to deal with continuously
changing terminology?
Case Study Background
117. 116
Goals of Gumshoe
Network Station Manager search
Thin Client ManagerProduct names change:
Continually changing terminology
Domain-specific meaning
Paula Summa search
bring Paula Summa from
employee directories
per diem search
Domain-specific repetitions
popcorn search
conference call!
⢠Result 1: IBM Travel: Per Diem
⢠Result 2: IBM Travel: Per Diem Rates
⢠Result 3: IBM Travel: National perdiems
⢠Result 25: IBM Travel: Per Diem Policy
âŚ
Gumshoe:
⢠Generic search solution, customizable & maintainable in many domains
â Simple customization with reasonable effort
â Ongoing search-quality management
⢠Philosophy: programmable search
Case Study Background
118. 117
Programmable Search: Main Idea
⢠Goals:
â Transparency
⢠Know âpreciselyâ why every result item is being brought back
⢠Understand how changes in content/intents affect search
â Maintainability and âDebugabilityâ
⢠Ranking logic is guided by explicit rules
⢠Properly react to changes in content/intents
⢠Building blocks:
â Deep analytics on documents
â Domain-specific analysis of queries
â Transparent customizable rule-driven ranking
runtime rules
backendbackend
analytics
interpretations
Case Study Background
119. 118
Distributed Analytics Platform (IBM InfoSphere BigInsights)
Crawling, information extraction, token generation (TG), indexing
Search runtime
Index
Index and rule
update services
backendbackend
analytics
runtime rulesinterpretations
backend
frontend
Implementation Architecture
Case Study Background
120. 119
Backend Analytics: 3 Parts
Local Analysis
(per-page analysis)
Global Analysis
(cross-page analysis)
Token Generation
(TG)
index
Case Study Background
121. 120
Local Analysis
⢠Categorizing pages
â Label pages by custom categories
⢠IBM examples: HR, person, IT help, ISSI, sales information,
marketing, corporate standards, legal & IP-law, âŚ
â Geo classification
⢠Associate documents with the relevant countries & regions
⢠Annotating pages
â Identify HomePage annotation for people, projects,
communities, âŚ
Simply knowing where a page is physically hosted is not enough
(example: Czech Republic hosts all pages for IBM in Europe)
Case Study Backend Local Analysis
122. 121
⢠Declarative approach
â Define an operator for each basic operation
⢠Input tuple of annotations
⢠Output tuples of annotations
â Compose operators to build complex extractors
⢠Algebraic expression
⢠One document at a time trivial parallelism.
⢠Benefits of declarative approach:
â Expressivity: Richer, cleaner rule semantics
â Performance: Better performance through optimization
Declarative IE System
Case Study Backend Local Analysis
123. 122
InfoSphere
Streams
Cost-based
optimization
...
SystemT â Overview
InfoSphere
BigInsights
SystemT RuntimeSystemT Runtime
Input
Documents
Extracted
Objects
SystemTSystemT
IBM Engines
UIMA
SystemT
Highly embeddable runtime
AQL Extractors
Embedded machine
learning model
AQL Rules
create view SentimentForCompany as
select T.entity, T.polarity
from classifyPolarity (SentimentFeatures ) T;
create view Company as
select ...
from ...
where ...
create view SentimentFeatures as
select
from ;
Case Study Backend Local Analysis
124. 123
G J Chaitin Home Page
Homepage Identification
Title Extraction
Matching titleMatching title
patterns
Title
s
Dictionary
Match
Home Page for
G J Chaitin
⢠http://w3.ibm.com/hr/idp/
⢠http://w3-03.ibm.com/isc/index.html
⢠http://chis.at.ibm.com/
URL Extraction
URLs
Matching URLMatching URL
patterns
Homepage for: idp isc chis
Employee
directory
⌠many more âŚ
Intranet
page
[Zhu et al., WWWâ07]
Case Study Backend Local Analysis
125. 124124 IBM Confidential124 IBM Confidential
Among the 38 pages with the exact same title,
which is the best for âPaula Summaâ?
Role of Global Analysis
Case Study Backend Global Analysis
126. 125
Person
Title
Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Ho Ching-Tien Tien Ho Ho, Tien
Howard Ho Ching-Tien H. ...
Global Technology Services
TG
Howard Ho Ching Tien ...
gts Global Technology Services
Global Technology Technology
Services Global Technology ...
GlobalTechnologyServices
nGramTG
spaceTG
âŚâŚ
âŚ
âŚ
âŚ
Case Study Backend Token Generation
127. 126
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
⢠Rewrite rules
⢠Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
⢠Grouping rules
⢠Re-ranking rules
Case Study Frontend
128. 127
Phase 3: Result Construction
Phase 2: Relevance Ranking
Phase 1: Query Semantics
query search rewrite rules
queries
interpretations
partially ordered interpretations
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
Runtime Flow in More Details
Case Study Frontend
129. 128
Runtime Rules: Pattern-Action Language
(Fagin 2012)
Query Pattern Queries Matching Possible Action
EQUALS
[r=ibm|information|info]
[d=COUNTRY]
⢠ibm germany
⢠info india
Rewrite into â[country] hrâ
(e.g., germany hr)
ENDS_WITH installation
⢠acrobat installation
⢠db2 on aix installation
Replace installation with ISSI
(e.g., acrobat ISSI)
CONTAINS directions to
[d=SITE]
⢠driving directions to almaden
⢠directions to watson from jfk
Pages of âsiteservâ category
should be ranked higher
STARTS_WITH
[d=PERSON]
⢠john kelly biography
⢠steve mills announcement
Group together pages that
represent blog entries
Pattern expression,
matched against the
keyword query
Perform when
matchQuery pattern âAction
⢠Similar to the query-template rules of Agarwal et al. [WWW 2010]
Query SemanticsCase Study Frontend
131. 130130
The most important IBM page for benefits
changes over time: currently it is netbenefits
Whatâs Best for Benefits?
Query SemanticsCase Study Frontend
136. 135
135
IBM Confidential
Complex Rules
java jim and not in person category
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
java search
Query SemanticsCase Study Frontend
137. 136
InterpretationsScenario: An IBM employee wants
to download Lotus Symphony 1.3
Runtime interpretation:
download symphony 1.3 category=issi software=symphony 1.3
Query SemanticsCase Study Frontend
138. 137
IBM Confidential
Complex Rules
java jim and not in person category
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
java search
Query SemanticsCase Study Frontend
139. 138
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
⢠Rewrite rules
⢠Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
⢠Grouping rules
⢠Re-ranking rules
Relevance RankingCase Study Frontend
140. 139
Person
Title
Recall: Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Global Technology Services
TG
Howard Ho Ching Tien ...
gts Global Technology Services
Global Technology Technology
Services Global Technology ...
GlobalTechnologyServices
nGramTG
spaceTG
âŚâŚ
âŚ
âŚ
âŚ
Ho Ching-Tien Tien Ho Ho, Tien
Howard Ho Ching-Tien H. ...Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
Relevance RankingCase Study Frontend
141. 140
Annotation + TG Relevance Bucket
Howard Ho Ching Tien ...
GlobalTechnologyServices
âŚâŚ
Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
query search Relevance buckets
â˘Buckets are ranked
â Based on annotation type
â Based on TG quality
â˘A page can belong to
multiple buckets
â˘Within each bucket,
ranking is by
conventional IR
âŚâŚ
Relevance RankingCase Study Frontend
142. 141
Ranking by Relevance Buckets
grouping rules
ordered & grouped results final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
employment verification search
Relevance RankingCase Study Frontend
143. 142
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
⢠Rewrite rules
⢠Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
⢠Grouping rules
⢠Re-ranking rules
Result ConstructionCase Study Frontend
144. 143
Grouping Rules
⢠Grouping rules define how search results should be
grouped together
⢠Search administrators can improve the diversity of
search results (in 1st page)
â Based on their familiarity with the data sources
Group pages of the same category
per diem travel, you-and-ibm
ANY ISSI, IT Help Central, Forum,
Bluepedia, Media Library, âŚ
Query pattern
Result ConstructionCase Study Frontend
145. 144
Need first page diversity
Flooding with Similar Pages
Result ConstructionCase Study Frontend
146. 145145 IBM Confidential
per diem travel, you-and-ibm
Grouping Rule to the Rescue
Result ConstructionCase Study Frontend
147. 146146 IBM Confidential
per diem travel, you-and-ibm
final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results
per diem search
Grouping Rule to the Rescue
Result ConstructionCase Study Frontend
148. 147
Re-ranking Rules
⢠Re-ranking rules adjust ranking of
search results based on categories
⢠Example: search administrator specifies the
important sources of âhot/current topicsâ
Hot topics Rank these categories higher
Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, âŚ
Result ConstructionCase Study Frontend
149. 148
Bluepedia
Technical News
Homepages of
âAbout IBMâ
Hot topics Rank these categories higher
Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, âŚ
Re-ranking Rule for Hot Topics
Result ConstructionCase Study Frontend
150. 149
Re-ranking Rules for Person Queries
[d=PERSON]
executive_corner, media_library,
organization_chart, files
Result ConstructionCase Study Frontend
151. 150150 IBM Confidential
per diem travel, you-and-ibm
final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results
per diem search
Grouping Rule to the Rescue
Result ConstructionCase Study Frontend
152. 151
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
⢠Rewrite rules
⢠Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
⢠Grouping rules
⢠Re-ranking rules
Case Study Frontend
153. 152
What Administrators NeedâŚ
⢠Search administrators have major problems
with an opaque search engine
⢠Programmable search provides
â Customization to the specific domain
â Ongoing search-quality management
Allows the building of search quality toolkit.
Recap:
Case Study Admin
157. 156
Proof of Pudding is in the Eating
⢠Immediate Positive Impact within first 3 months
â Improve natural clickthrough rate by 100%+
â Top 5 results: selected about 90% of the time
⢠Sustained search quality Improvements 4 years since
going alive
⢠Stable natural search click through rate
Gumshoe (Aug. 2011â Oct. 2011)
Old Intranet Search (Aug. 2010â Aug. 2011)
Natural
clickthrough
rate
Case Study Results
158. 157
Summary
Programmable search:
Simple & flexible customization
Search quality management
Backend Analytics
Local analysis
(per-page analysis)
Global Analysis
(cross-page analysis)
Token Generation
(TG)
[Fagin et al.,
PODSâ10,
PODSâ11]
Tooling
⢠Search provenance
⢠Rule suggestion
⢠Utilization of relevance buckets
[Li et al.,
SIGIRâ06,
Zhu et al.,
WWWâ07]
Phase 1:
Query Semantics
⢠Rewrite rules
⢠Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
⢠Grouping rules
⢠Re-ranking rules
[ Bao et al.,
ACLâ2010,
SIGIRâ2012
CIKMâ2012]
Case Study Summary
160. 159
Search Engine Components
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control/improvement
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
161. 160
Future Directions
Data Heterogeneity
A rich variety of data types need to be searched in
enterprises.
⢠docs, databases, images, videos, social graphs, etc.
observations
How to automatically identify relevant data types, and
search and rank across different data types?
⢠e.g., for image search, should image recognition techniques
be incorporated in enterprise search engines? If so, how?
questions
162. 161
Future Directions
Data Freshness
New data is continuously collected and published in
enterprises, the rate of which can be very fast.
Web search engines are not required to index new websites
quickly, but in enterprises, new contents may need to be
searchable asap.
observations
How to build efficient real-time indexes to ensure data
freshness in enterprise search?
questions
163. 162
Future Directions
Search Context
Enterprise search users have richer profiles than web users.
⢠activities, bio, position, projects, experiences, etc.
observations
How to utilize usersâ contexts to provide customized results?
Is it possible to predict the information a user may want, and
push it to the user?
questions
164. 163
Future Directions
User Preference
Different users in an enterprise have different expertise, and
may prefer different ways to express queries.
⢠e.g., some users prefer pure keyword search, while
others may want lightly-structured queries.
observations
How to effectively satisfy different needs for expressing
queries for different users?
questions
165. 164
Future Directions
Question Answering
The purpose of many enterprise searches are to find
answers to questions.
⢠e.g., what is the previous name of a product, and when
did we change to the current name?
observations
Is it possible to effectively use natural language processing
techniques and domain knowledge to automatically answer
natural language questions?
questions
166. 165
Future Directions
Transactional Search
Over 1/3 of enterprise search queries is transactional. It will
be desirable if enterprise search engines can recommend
business processes to accomplish a certain task given a
transactional search.
⢠E.g., given a customerâs lengthy complaint letter, how to find
out the departments relevant to the complaints.
observations
How to better support transactional search? How to initiate
a business process based on the results of a search?
questions
167. 166
Future Directions
Big Data Analytics
Rich information and knowledge lies in big data. Many
employees (not just data analysts) may benefit from the
ability to perform analytics on the companyâs big data.
observations
How to build a low-cost, interactive platform that allows a
large number of employees to issue analytical queries?
How to give employees the capabilities to analyze big data,
if they have little knowledge of SQL or MapReduce
programming?
questions
168. 167
Future Directions
Tooling for Search Quality Maintenance
Most enterprise search engines have to be manually
evaluated and tuned by a search administrator with domain
knowledge, in an ad-hoc fashion.
observations
Can we automate this process, or at least minimize manual
involvement?
Can we fully utilize explicit user feedbacks?
⢠Explicit user feedbacks are easier to obtain in enterprise
search, and there are less spams.
questions
169. Thanks.
Acknowledgement:
IBM Research: Sriram Raghavan, Fred Reiss, Shiv Vaithyanathan, Ron Fagin
IBM CIOâs Office: Nicole Dri, Brian C. Meyer
LogicBlox: Benny Kimelfeld*
TripAdvisor: Adriano Crestani Campos*
Facebook: Zhuowei Bao*
NJIT: Yi Chen
UNSW: Wei Wang
* work done while at IBM