SlideShare uma empresa Scribd logo
1 de 169
Baixar para ler offline
Enterprise Search in the
Big Data Era
Yunyao Li
Ziyang Liu
Huaiyu Zhu
IBM Research - Almaden
NEC Labs
IBM Research - Almaden
1
Enterprise Search
• Providing intuitive access to an organization’s
various digital content
1
Report Find
• IDC report [IDC 05] • $5k/person/year wasted salary due to poor search
• 9-10hr/person/week doing search
• unsuccessful 1/3-1/2 of the time
• Butler Group
[Edwards 06]
• 10% of salary cost wasted through ineffective search
• Accenture survey
[Accenture 07]
• Middle managers spend 2 hr/day searching
• >50% of what they found have not value
• Hawking, Enterprise Search, http://david-hawking.net/pubs/ModernIR2_Hawking_chapter.pdf
• [IDC 05] the enterprise workplace: How it will change the way we work”. IDC Report 32919
• [Edwards 06] www.butlergroup.com/pdf/PressReleases/ESRReportPressRelease.pdf
• [Accenture 07] http://newsroom.accenture.com/article_display.cfm?article_id=4484
2
Magic
Search from User’s Point of View
Results
1 ……………..
2 ……….
3 ……………..
4 ……….
…………
……………
INTRODUCTION SEARCH
3
What Happens Behind the Scene
Backend
Collect data
Analyze data
Index data
Frontend
Serve user queries
Return results
Index
Data
Source
INTRODUCTION SEARCH
4
How Does a Query Match a Document?
Index
Document
………………..
………… …………
… ….. ……..
…………………
…………
Document
………………..
………… …………
… ….. ……..
…………………
…………
Results
Doc 1 ………..
Doc 2 …….
Doc 3 …………..
Doc 4 ……….
…………
……………
Analyze query
Present results
Analyze document
Search index
Build index
INTRODUCTION SEARCH
5
Search Is More Than Keyword Match
• Specific features in documents are important
– Title, url, person name, product, actions, …
• Features combine to form higher level concepts
– In document: Home page + person personal homepage
– Cross document: URL link analysis, …
• The string representation in document may not match that in
user query
– Person name: Bill Clinton William Jefferson Clinton
• User queries may be ambiguous
– Multiple interpretations
• Presenting the results to user
– Ranking, grouping, interactive refinement
INTRODUCTION SEARCH
6
Internet vs Enterprise – Web data
[Fagin WWW2003]
Internet Enterprise
Creation of
content
• Democratic
• Appealing to reader
• Links approval
• Bureaucratic
• Conform to mandate
• Links internal structure
Relevant
query results
• Large number
• Overlapping information
• Reasonable subset suffices
• Ranking is more universal
• Small number
• Specific function
• Specific pages required
• Ranking is relative to query
Spamming
• Spam infested
• Ranking can only be based
on external authority
• Mostly spam-free
• Ranking based on content
or metadata are reliable
Search
engine
friendliness
• Web pages designed to be
search results
• Web page document
• Documents not designed to
be search results
• Special treatment
INTRODUCTION ENTERPRISE VS INTERNET
7
Internet vs Enterprise – Big Data
Internet Enterprise
Content being
searched
• Sources: Web crawl
• Formats: html, xml, pdf,
• Variety of sources
• Variety of formats:
• Email, database, application-
specific access and formats
Search queries
/expected
results
• Target: web pages, office
documents
• Expect list of documents
• Expect little personalization
• Return result directly
• Target: rows, figures, experts, ...
• Expect customized results
• Personalization required:
geography, access,
• Customize results
Related
information
• Link approval
• Small number of domain-
specific knowledge
• Generic analysis
• Link organization structure
• Large number of dynamic
domain-specific knowledge
• Highly specialized analysis
Skill set of
search admins
• Large number of admins
• Search experts
• Facilitate update of search
algorithms
• Small number of admins
• Domain experts
• Facilitate use of domain
knowledge
INTRODUCTION ENTERPRISE VS INTERNET
8
Search Engine Components
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control/improvement
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
INTRODUCTION TUTORIAL OVERVIEW COMPONENTS
9
Search Engine Architecture
Backend
Collect data
Analyze data
Store and index data
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
10
Main Backend Functions
Analysis (Understand)
Information extraction
Analyse and transform data
Indexing (Prepare for search)
Generate terms suitable for match queries
Index search terms
index
Document Ingestion (Collect)
Collect all the data to be searched
Transform and store as documents
Local Analysis
(in-document analysis)
Global Analysis
(cross-document analysis)
11
Backend Section Outline
• Overview
• Data Ingestion
• Local analysis
• Global analysis
• Indexing
12
Typical analytics pipeline
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Collect data
• Transform to uniform
document format
• Store in document store
Data ingestion
• Collect data
• Transform to uniform
document format
• Store in document store
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Cross document analysis
• Rank, group, merge, and
filter documents
Global analysis
• Cross document analysis
• Rank, group, merge, and
filter documents
index
Indexing
• Generate search terms,
• Index documents by
search terms
Indexing
• Generate search terms,
• Index documents by
search terms
Local analysis:
• Information extraction
from each document
Local analysis:
• Information extraction
from each document
DI
BACKEND OVERVIEW
13
Digression: Classical IR
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Given set of files
Data ingestion
• Given set of files
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Calculate statistics of
terms in documents
Global analysis
• Calculate statistics of
terms in documents
index
Indexing
• Generate search terms,
• Index by terms with
statistics
Indexing
• Generate search terms,
• Index by terms with
statistics
Local analysis:
• Tokenize
• Stop wording
• Stemming
• Form n-grams
Local analysis:
• Tokenize
• Stop wording
• Stemming
• Form n-grams
DI
BACKEND OVERVIEW
14
Digression: Classical Web search
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Crawl web pages
Data ingestion
• Crawl web pages
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Calculate eigenvalues of
connection matrix
Global analysis
• Calculate eigenvalues of
connection matrix
index
Indexing
• Generate search terms
• Index documents by
search terms, with page
rank
Indexing
• Generate search terms
• Index documents by
search terms, with page
rank
Local analysis:
• Extract out links
Local analysis:
• Extract out links
DI
BACKEND OVERVIEW
15
Demands of Enterprise Search
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Handle variety of sources
• Handle variety of formats
• Deal with access policy
• Deal with update policy
Data ingestion
• Handle variety of sources
• Handle variety of formats
• Deal with access policy
• Deal with update policy
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Cross document analysis
• Rank, group, merge, and
filter documents
Global analysis
• Cross document analysis
• Rank, group, merge, and
filter documents
index
Indexing
• Generate search terms,
• Index documents by
search terms
Indexing
• Generate search terms,
• Index documents by
search terms
Local analysis:
• Incorporate domain knowledge
• Extract rich set of semantics
• Categorize documents
Local analysis:
• Incorporate domain knowledge
• Extract rich set of semantics
• Categorize documents
DI
BACKEND OVERVIEW
16
• Efficient incremental updates
– Fast turn around time for updates
• System performance and reliability
– Scaling with data size and resource available
– Fault tolerance
• Ease of administration quality improvement
– Allow search admin to customize domain specific
configurations
BACKEND OVERVIEW CHALLENGES / OPPORTUNITIES
Desiderata of backend
17
Backend Section Outline
• Overview
• Data Ingestion
• Local analysis
• Global analysis
• Indexing
18
Data Ingestion
BACKEND DATA INGESTION
Doc. Store
Crawl/push
Web DB App
Convert to
document
…
Convert
to text
From: xxx
To: yyy
Date: 12/21
………………..
… …… …………
… … ……..
Attch: file1.pdf
Docid: 0002
___________
…….ABCD…..
… 01/12 …………
… … ……..
………
………..
… ………
Docid: 0001
___________
From: xxx
To: yyy
Date: 12/21
………………..
… …… …………
… … ……..
Attch: file.pdf
Email +attach
Docid: 0002
___________
title: ABCD.
Date: 01/12
…………
… … ……..
………
………..
… ………
Docid: 0001
___________
From: xxx
To: yyy
Date: 12/21
………………..
… …… …………
… … ……..
Attch: file.pdf
Variety of
sources
Support update &
retention policy
Pdf file
19
Document-centric View
• Data as a collection of documents
– Document as unit of storage and search result.
– Three major components
• Unique document identifier in the whole system
• Metadata fields: url, date, language, …
• Content field: text to be searched
• Representation of data of different structures
– Web pages Each page is a document
– Relational data Each row is a document
– Hierarchical data Each node is a document
BACKEND DATA INGESTION
20
Push vs Pull
Pull Push
Definition
• Search engine initiate
transfer of data
• (Web crawler)
• Content owner initiate transfer
of data
• (Apps with push notice)
Advantage
• Operated by search engine
• Use standard crawlers
• Can handle special access
methods
• Easy to adjust refresh rate
• Easy to handle special format
Disadvantage
• Difficult to access special
data sources
• Difficult to adjust domain
specific treatment
• Need synchronization with
content owner
Applicability
• Prevalent for Internet
• Also useful for enterprise
• Rare for Internet
• Very important for enterprise
BACKEND DATA INGESTION
21
Transform the Data
• Format conversion
– Convert content to text: pdf, doc, …
• Keep as much structure as possible
• Metadata conversion
– Obtain and transform metadata: HTTP headers,
DB table metadata, …
• Merge /split documents
– One-to-many: Zip file, email thread, attachments
– Many-to-one: social tags merge to original doc
BACKEND DATA INGESTION
22
Storage options
Options Pro Con
SQL database
• Traditional RDBM strengths
• Support insert, update,
delete, fielded query,
• Too much system overhead
Indexing
engine
(Lucene)
• Closer to document centric
view
• Supports insert, delete,
fielded query
• No direct in-document update
• Need special treatment for
distributed processing
NoSQL
databases
• Light weight
• Sufficient for simple use
• May lack features in the future
• Transaction?
BACKEND DATA INGESTION
Issues to consider
• In document update
• Access/Retention policy
• Parallel processing
23
Backend Section Outline
• Overview
• Data Ingestion
• Local analysis
• Global analysis
• Indexing
24
Local Analysis
• Annotating pages
– Extract structured elements: title, header, …
– Extract features for people, projects,
communities, …
– Extract features for cross-document analysis.
• Categorizing pages
– Label by standard categories
• Language, geography, date, …
– Label pages by custom categories
• IBM examples: HR, person, IT help, ISSI, sales information,
marketing, corporate standards, legal & IP-law, …
Local analysis is essentially information extraction
BACKEND LOCAL ANALYSIS
25
Rule-based IE ML-based IE
PRO
• Declarative
• Easy to comprehend
• Easy to maintain
• Easy to incorporate domain
knowledge
• Easy to debug
• Trainable
• Adaptable
• Reduces manual effort
CON
• Heuristic
• Requires tedious manual
labor
• Requires labeled data
• Requires retraining for domain
adaptation
• Requires ML expertise to use or
maintain
• Opaque (not transparent)
BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION
Rule-based vs. Learning-based IE
26
Commercial Vendors
(2013)
NLP Papers
(2003-2012)
100%
50%
0%
3.5%
21%
75%
Rule-
Based
Hybrid
Machine
Learning
Based
45%
22%
33%
Large Vendors
67%
17%
17%
All Vendors
• GATE Information Extraction
• IBM InfoSphere BigInsights
• Microsoft FAST
• SAP HANA
• SAS Text Analytics
• HP Autonomy
• Attensity
• Clarabridge
Example Industrial Systems
Source: [CLR2013] Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!, EMNLP 2013
BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION
Landscape of Entity Extraction
Implementations
27
Intranet
page
NavPanel Extraction
NavPanels
Self link
identification
Title Extraction
Matching title
patterns
Titles
Dictionary
Match
Person name
dictionary
Person name in title
Title Extraction
Matching title
patterns
Titles
Title Name
URL Extraction
URLs
Matching URL
patterns
URL Name
Person name dictionary = employee directory
IBM Global Services Security Home
IBM Global Services Security
G J Chaitin Home Page
G J Chaitin
1. http://w3-03.ibm.com/marketing/
2. http://w3-03.ibm.com/isc/index.html
3. http://chis.at.ibm.com/
1. marketing
2. isc
3. chis
BACKEND LOCAL ANALYSIS EXAMPLES
[Zhu et al., WWW’07]
Local analysis for different features
28
Consolidation
– Example: Document language consolidation
• HTTP header Accept-Language: en-us,en;q=0.5
• Meta tags <meta http-equiv="Content-Type" content="text/html;
charset=utf-8" />
• Document text encoding
• URL http://enterprise.com/hr/benefits/us/ca/
BACKEND LOCAL ANALYSIS TRANFORMATIONS
29
Backend Section Outline
• Overview
• Data Ingestion
• Local analysis
• Global analysis
• Indexing
30
Global Analysis
• Deduplication
– Save resources, reduce result clutter
• Identify root of URL hierarchy
– Used for result grouping and ranking
• Anchor text analysis
– Assign external labels to documents
• Social tagging analysis
– Assign tags and their weights to documents
• Identify different versions of the same document
– Due to variations in date, language, …
• Enterprise-specific global analysis
– When certain documents co-exists, do this …
• …
BACKEND GLOBAL ANALYSIS
31
Shingle based deduplication
(Leskovec, http://www.mmds.org/)
S1={s1, s2, …}
S2={s1, s3, …}
S3={s2, s3, …}
{h1(S1), h2(S2), …}
{h1(S2), h2(S2), …}
{h1(S3), h2(S2), …}
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Shingles:
• Character or token n-gram
• Possibly stemmed
• Possibly related to stop words
Shingles:
• Character or token n-gram
• Possibly stemmed
• Possibly related to stop words
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Minhash:
• Maps sets to integers
• Based on permutation of universal set
Jaccard distance :
Theorem:
The probability that the minhash function for a random
permutation of rows produces the same value for two sets
equals the Jaccard similarity of those sets
Minhash:
• Maps sets to integers
• Based on permutation of universal set
Jaccard distance :
Theorem:
The probability that the minhash function for a random
permutation of rows produces the same value for two sets
equals the Jaccard similarity of those sets
| A∊B | / | A∪B |
More diverse set of documents. More precise.
BACKEND GLOBAL ANALYSIS DEDUPLICATION
32
Metadata-based deduplication
(IBM Gumshoe search engine)
S1=[h11, h12, …]
S2=[h21, h22, …]
S3=[h31, h32, …]
G1 = {S1, …}
G2 = {S2, S3, …}
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Significant metadata:
• Document title
• Section headers
• Signatures from URL
Ensure that all similar candidates
have the same signature
Significant metadata:
• Document title
• Section headers
• Signatures from URL
Ensure that all similar candidates
have the same signature
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Group by signature
• Perform detailed analysis
In-group similarity analysis:
• Analyze documents within candidate groups
Group by signature
• Perform detailed analysis
In-group similarity analysis:
• Analyze documents within candidate groups
More customizable for intranet. Less cost.
BACKEND GLOBAL ANALYSIS DEDUPLICATION
33
URL Root Analysis (Zhu et al., WWW’07)
host1/b/a/~user1/pub
host1/b/a
host1/b/a/~user1/
host1/b/c
host1/b/a/x_index.htm/ host1/b/c/d host1/b/c/home.html
host1/b/c/d/e/index.html?a=us host1/b/c/d/e/index.html?a=uk
host1/b/c/d/e/index.html
• Given a set of documents all with the same value V of feature X.
• E.g., At one time all webpages from IBM Tucson site had the same title
• Find the roots of URL forest. These will be preferred result for query X=V.
• E.g., when searching for “Tuscon home page”, only the IBM Tuscon homepage will match.
BACKEND GLOBAL ANALYSIS ROOT ANALYSIS
34
Label Assignment (Zhu et al., WWW’07)
BACKEND GLOBAL ANALYSIS LABEL ASSIGNMENT
Document B
………………..
… …… …………
… … ……..
………… ………
…… ……
Document A1
………………..
… X home …
…………
… … ……..
………… ………
…… ……Document A2
………………..
… X home …
…………
… … ……..
………… ………
…… ……
Bookmark C1
X home
Anchor text global
analysis:
• Assign label “X” and
/ or “Y” based on
frequency
Bookmark C2
X
Bookmark C3
Y home
Document A2
………………..
… X home …
…………
… … ……..
………… ………
…… ……
Social tagging
global analysis:
• Assign label “X
home”, “X”, and “Y
home” based on
frequency
35
Entity Integration using HIL
Entity Population Rules
• Create entities (from raw records, other
entities, and links)
• Clean, normalize, aggregate, fuse
Various data
sources
Information
Extraction
Entity
Resolution
Fuse
Aggregate
Entity Integration
Entity Resolution
Rules
• Create links between
raw records or entities
Map
Unstructured
Data
Unified
entities
Defines entity types (the logical
data model of the integration flow)
(SQL-like) rules to specify the
integration logic
Raw
Records
HIL
[HernĂĄndez et
al, EDBT’13]
Declarative IE
(IBM SystemT)
[Chiticariu et al, ACL
2010]
Optimizing compiler to Big Data runtime (Jaql and Hadoop)
BACKEND GLOBAL ANALYSIS ENTITY INTEGRATION
36
Backend Section Outline
• Overview
• Data Ingestion
• Local analysis
• Global analysis
• Indexing
37
Indexing
• Generate and index search terms, to be
matched by terms generated at runtime from
user queries.
• Challenges:
– Extracted terms do not match user query terms
• Morphological changes, synonyms, …
– Importance of term depends on query
• Needs for bucketing of indexes, …
– Support of incremental indexing
BACKEND INDEXING
38
Term normalization
• Example: Date time normalization
– Given any of these
Wed Aug 27 10:06:11 PDT 2014
27 Aug 2014, 10:06:11
2014-08-27T10:06:11-07:00
27 Aug 2014
1409133971
– Normalize to 2014-08-27T10:06:11-07:00
– Other examples: Person names, product names,
…
BACKEND INDEXING TERM NORMALIZATION
39
Why Generate Variant Terms?
• Extracted feature string ≠ query string
– People names
• Document: John Doe Search: Doe, John Search: J Doe
– Acronym expansions
• gts Global Technology Services
– N-gram variant generation
• Title: reimbursement of travel expenses
• Terms: reimbursement, travel expenses, reimbursement travel,
reimbursement of travel, reimbursement expenses
• Normalization is not sufficient solution
– People names
• Document: John Doe J. Doe Search: Jean Doe J. Doe
• These are not supposed to match
• Solution:
– Generate variant terms with different levels of approximation.
BACKEND INDEXING VARIANT TERM GENERATION
40
Configurable Term Generation
• Configuration knobs determine the set of outputs
• Given “Mr. John (Jack) M. Doe Jr.”
– Configuration1:
Initial=both, Dot: with, NickName: both, MiddleName: both, NameSuffix:
without, Title: without, Comma:both
John M. Doe Doe, John M.
John Doe Doe, John
J. M. Doe Doe, J. M.
J. Doe Doe, J.
Jack M. Doe Doe, Jack M.
Jack Doe Doe, Jack
– Configuration2 (normalization):
Initial=without, Dot: without, NickName: without, MiddleName: without,
NameSuffix:without, Title: without, Comma: without
John Doe
BACKEND INDEXING VARIANT TERM GENERATION
41
Enterprise Search Backend
S1={f11, f12, …}
S2={f21, f22, …}
S3={f31, f32, …}
G1 = {g1, …}
G2 = {g2, g3, …}
LA GA Idx
Data ingestion
• Access various sources
• Document transform
• Format transform
Data ingestion
• Access various sources
• Document transform
• Format transform
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Document
………………..
… …… …………
… … ……..
………… ………
…… ……
Global analysis
• Deduplication
• URL root analysis
• Label assignment
• …
Global analysis
• Deduplication
• URL root analysis
• Label assignment
• …
index
Indexing
• Generate search terms,
Indexing
• Generate search terms,Local analysis:
• Information extraction
• Configurable
Local analysis:
• Information extraction
• Configurable
DI
BACKEND RECAP
42
Search Engine Architecture
Backend
Collect data
Analyze data
Store and index data
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control
Frontend
Interpret user query
Search index
Present results
Interact with user
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)
44
1. Ambiguity
• Optimal keywords may not be used.
– Misspelled
• “datbase”
– Under-specified
• polysemy: “java”
• too general: “database papers”
– Over-specified:
• synonyms, acronyms, abbreviations &
alternative names: “green card” ≡
“permanent residency”
• too specific: “MS Office 2007 for Mac x64
edition”
– Non-quantitative:
• “small laptop”
query cleaning query autocompletion
query refinement
query rewriting
query rewriting
45
Summary of Solutions
• query cleaning
– correct various types of spelling errors
• query autocompletion
– prevent spelling errors.
• query refinement
– making queries more specific, returning fewer results.
• query rewriting
– making queries more general / on-topic, returning more
relevant results.
• query forms
– enabling users to specify precise queries
FRONTEND AMBIGUITY
46
Graph-based Spelling Correction
(bao acl 11)
• Repartition the query.
– Each partition (token) should be plausible: confidence
(correcting it) > threshold.
– confidence: linear combination of multiple scores, parameters
learned from SVM.
• Domain knowledge is often used in calculating confidence.
• For each partition, generate candidate corrections with
high scores.
“enterpricsea rch”
“enterpricse arch”
“enterpric search”
“enter pric search”
etc.
price: 0.8
prim: 0.6
etc.
pric
QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY
“enterpricsea rch”
47
Graph-based Spelling Correction
(bao acl 11)
• Build a graph that connects candidate
corrections.
• Each full path is a candidate query.
– Find k top-weighted full paths
enterprise
enter
price
prim
arc
sea rich
search
1. correction score
(node weight)
2. merge penalty
(node weight)
3. split penalty
(edge weight)
enterprise → search
enter → price → sea → rich
e.g.,
weights
QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY
price: 0.8
prim: 0.6
etc.
pric
“enterpricsea rch”
48
Graph-based Spelling Correction
(bao acl 11)
• Weight doesn’t consider term correlations.
• Calculate a score for each path
– Score includes term correlations.
• This ensures the cleaned query has good quality
results.
• Correlations are computed based on number of co-
occurrences.
• Finally returns paths with high scores.
e.g., correlation(“enterprise search”) > correlation (“enterprise arc”)
QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY
e.g., “enterprise search” vs. “enterprise arc”
49
XClean (lu icde 11)
– based on the noisy channel model that finds the
intended word given the user’s input word.
– results on XML are subtrees rooted at entity nodes.
• A result quality score is calculated for each entity node in
T, and then aggregated.
• e.g., if Johnny and Mike works in the same department,
then “Johnn, Mike” → “Johnny, Mike” rather than “John,
Mike”.
– processes each word individually, i.e., no merge or
split.
Query Cleaning on Relational Data: Pu VLDB 08
related
department
head
Johnny
employees
…
QUERY CLEANING STRUCTURED DATAFRONTEND AMBIGUITY
50
Summary of Solutions
• query cleaning
– correct various types of spelling errors
• query autocompletion
– prevent spelling errors.
• query refinement
– making queries more specific, returning fewer results.
• query rewriting
– making queries more general / on-topic, returning more
relevant results.
• query forms
– enabling users to specify precise queries
FRONTEND AMBIGUITY
51
Query Autocompletion
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
52
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
Error-Tolerating Autocompletion
(chaudhuri sigmod 09)
desr
desert
dessert
deserve
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
53
n
c
ae
x
Error-Tolerating Autocompletion
(chaudhuri sigmod 09)
data contains “search”, “sand” and “text”
max. edit distance = 1
no input input: s input: se input: sen
s
a
r
t
e
t
h
d
n
c
ae
x
s
a
r
t
e
t
h
d
n
c
ae
x
s
a
r
t
e
t
h
d
n
c
ae
x
s
a
r
t
e
t
h
d
Showing results instead of keywords
can be achieved
by associating inverted lists to trie nodes.
trie
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
54
Tastier(li vldbj 11)
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
“have a nni” show results for “have a nice day”
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
55
Tastier(li vldbj 11)
• Trie-based (similar as previous paper).
– Trie leaf nodes are associated with inverted lists.
• To handle multiple keywords:
– Each record/document is associated with a sorted lists of
words in it (forward lists).
• so that a binary search can determine whether a string appears
in a record/document as a prefix.
• why not hash? Because we need to match prefix, not whole
words.
• Inverted list intersections are computed
incrementally using cache for improved efficiency.
“have a nice day” “a, day, have, nice”
example
forward list
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
56
Phrase Prediction(nandi vldb 07)
Problem Space Dimensions
showing keywords
vs.
showing results
single keyword
vs.
multiple keyword
exact matching
vs.
fuzzy matching
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
a nice have a nice day
57
Phrase Prediction(nandi vldb 07)
• Suggest phrases given the user input phrase.
– Need to find a good length of a suggested phrase.
• Too short: utility is small.
• Too long: low chance of being accepted.
• (modified) suffix tree-based.
– Each node is a word, rather than a letter.
– Why not use trie: phrases have no definitive starting point.
A phrase may start in the middle of a sentence (i.e., start at
a suffix of the sentence), hence suffix tree.
• Significant phrases.
laptop
have a nice day
QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
58
Summary of Solutions
• query cleaning
– correct various types of spelling errors
• query autocompletion
– prevent spelling errors.
• query refinement
– making queries more specific, returning fewer results.
• query rewriting
– making queries more general / on-topic, returning more
relevant results.
• query forms
– enabling users to specify precise queries
FRONTEND AMBIGUITY
59
Query Refinement
• Motivation
– Some under-specified queries on large data
corpus have too many results.
– Ranking cannot always be perfect.
• Approaches
– Identifying important terms in results
(structured/unstructured)
– Clustering results
(structured/unstructured)
– Faceted search
(structured)
FRONTEND AMBIGUITY QUERY REFINEMENT
60
Using Clustered Results (liu pvldb 11)
All suggested queries are about
programming language.
It is desirable to refine an ambiguous query
by its distinct meanings.
“Java”
FRONTEND AMBIGUITY QUERY REFINEMENT
61
• → Input: clustered results
– clustering method is irrelevant.
– e.g., the result of “Java” may have 3 clusters
corresponding to Java language, Java island, and
Java tea.
• ← Output: one refined query for each cluster.
Each refined query:
– maximally retrieves the results in its cluster
(recall)
– minimally retrieves the results not in its cluster
(precision)
Using Clustered Results (liu pvldb 11)
FRONTEND AMBIGUITY QUERY REFINEMENT
62
Using Important Terms in Results
(tao edbt 09)
• For relational data only.
• Given a keyword query, it outputs top-k most
frequent non-keyword terms in the results,
without generating the results.
– Avoiding result generation is possible since the
terms are ranked only by frequency: tradeoff of
quality and efficiency.
Data Clouds (for structured data): Koutrika EDBT 09
(more sophisticated term ranking, but needs to generate query results first.)
related
FRONTEND AMBIGUITY QUERY REFINEMENT
63
Faceted Search
all
location:
Sunnyvale, CA
location:
Phoenix, AZ
location:
Amherst, MA
department:
data management
department:
machine learning
1. How to select facets and facets conditions at each level, to
minimize the user’s expected navigation cost?
2. How to rank facets and facets conditions?
challenges
Chakrabarti SIGMOD 04 Kashyap CIKM 10
……
……
……
FRONTEND AMBIGUITY QUERY REFINEMENT
64
Summary of Solutions
• query cleaning
– correct various types of spelling errors
• query autocompletion
– prevent spelling errors.
• query refinement
– making queries more specific, returning fewer results.
• query rewriting
– making queries more general / on-topic, returning more
relevant results.
• query forms
– enabling users to specify precise queries
FRONTEND AMBIGUITY
65
Query Rewriting
• Motivation
– Synonyms, alternative names: “green card” vs
“permanent residency”.
– Too specific: “MS Office 2007 for Mac x64 edition”
– Non-quantitative: “small laptop”
• Approaches
– Using query/click logs
– Finding rewriting rules from missing results
• e.g., replace “green card” with “permanent residency”.
– Using “differential queries”
FRONTEND AMBIGUITY QUERY CLEANING
66
Using Query and Click Logs (cheng
icde 10)
The availability of query and click logs
can be used to assess ground truth.
query Q
query log
click log
synonyms
hypernyms
hyponyms
of Q
“query” “search”
synonym
“MySQL” “database”
hypernym
“database” “MySQL”
hyponym
find and return historical queries
whose “ground truth” (via click
log) significantly overlaps with
top-k results of Q.
idea
FRONTEND AMBIGUITY QUERY CLEANING
67
Automatic Suggestion of Rewriting
Rules from Missing Results (bao sigir 12)
• Challenges for automatically generating
rewriting rules:
– rules should be semantically natural.
– a new rule designed for one query may eliminate
good results of another query.
FRONTEND AMBIGUITY QUERY CLEANING
“green card”
result d is missing / should
be ranked higher
result d contains phrase
“permanent residency”
rewriting rule:
green card → permanent residency
68
→ Input: query q, missed
desirable results d
← Output: selected
set of rules
Generate candidate
rules L → R.
• L: n-grams in q.
• R: n-grams in high-
quality fields of d.
Identify semantically
natural rules by
machine learning.
Greedily select a
subset of rules that
maximizes the
overall query quality.
Automatic Suggestion of Rewriting
Rules from Missing Results (bao sigir 12)
FRONTEND AMBIGUITY QUERY CLEANING
green card → permanent residency
green card → federal government
69
Keyword++ (Entity Databases)
(xin pvldb 10)
“small IBM laptop”
ID Product Name BrandName Screen Size Description
1 ThinkPad E545 Lenovo 15 The IBM laptop...small
business…
2 ThinkPad X240 Lenovo 12 This notebook...
To “understand” a term, compare two queries that
differ on this term, and analyze the differences of
attribute value distributions in the results.
idea
e.g., to understand term “IBM”, we can compare the results of
“IBM laptop” vs. “laptop”.
FRONTEND AMBIGUITY QUERY CLEANING
70
Suppose: “IBM laptop” → 50 results, 30 having “brand: Lenovo”
“laptop” → 500 results, only 50 having “brand: Lenovo”
The difference on “brand: Lenovo” is significant,
reflecting the meaning of “IBM”.
IBM brand: Lenovo
small order by size ASC
Offline: compute the best mapping for all terms in query log
Online: compute the best segmentation of the query (DP).
“laptop”
“small laptop”
likewise:
Keyword++ (Entity Databases)
(xin pvldb 10)
FRONTEND AMBIGUITY QUERY CLEANING
71
Summary of Solutions
• query cleaning
– correct various types of spelling errors
• query autocompletion
– prevent spelling errors.
• query refinement
– making queries more specific, returning fewer results.
• query rewriting
– making queries more general / on-topic, returning more
relevant results.
• query forms
– enabling users to specify precise queries
FRONTEND AMBIGUITY
72
Offline: how many query forms, and which query
forms, should be generated?
• Too many – hard to find the relevant forms.
• Too few – limiting query expressiveness.
Online: how to identify query forms relevant to
users’ search needs?
Query Forms
Enabling users to issue precise structured queries
without mastering structured query languages.
advantage
challenges
Baid SIGMOD 09 Jayapandian PVLDB 08 Ramesh PVLDB 11 Tang TKDE 13
FRONTEND AMBIGUITY QUERY FORMS
Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)
74
2. Ranking
Ranking Method Categories
Unstructured Data
• represents queries and documents using vectors
• each component is a term; the value is its weight
• ranking score = similarity (query vector, result vector)
Structured Data
• a document → a node or a result (subgraph/subtree)
vector space model
proximity based ranking
…
authority based ranking
…
FRONTEND RANKING
75
2. Ranking
Ranking Method Categories
Unstructured Data
• proximity of keyword matches in a document can
boost its ranking.
Structured Data
• weighted tree/graph size, total distance from root to
each leaf, semantic distance, etc.
vector space model
…
authority based ranking
…
proximity based ranking
FRONTEND RANKING
76
2. Ranking
Ranking Method Categories
vector space model
…
…
Unstructured Data
• nodes linked by many other important nodes are
important.
Structured Data
• authority may flow in both directions of an edge
• different types of edges in the data (e.g., entity-entity
edge, entity-attribute edge) may be treated differently.
proximity based ranking
authority based ranking
FRONTEND RANKING
Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)
78
3. Representation
• Enterprise corpus can be much more
heterogeneous than a collection of documents or
web pages.
• Different searches may have different types:
retrieving a document, a figure, a tuple, a
subgraph, analytical keyword queries, etc.
Result diversification
Result summarization
Result differentiation
solutions
FRONTEND REPRESENTATION
79
Result Diversification
• Result diversification is essentially the same
problem as query refinement.
– e.g., Java → Java language, Java tea, Java island.
• Same techniques apply.
FRONTEND REPRESENTATION DIVERSIFICATION
80
Result Summarization
• Unstructured data: lots of work on text
summarization in machine learning, natural
language processing and IR communities.
• Structured data:
– Size-l object summary (Relational)
– Result snippet (XML)
Das, CMU 07 (unpublished)
Nenkova, Mining Text Data 12
surveys
FRONTEND REPRESENTATION SUMMARIZATION
81
Size-l Object Summary (fakas pvldb 11)
……Mike……
first
window
“Mike”
unstructured
Mike
paper paper patent patent…
conference John …
… … …
… …
?
structured
FRONTEND REPRESENTATION SUMMARIZATION
82
Size-l Object Summary (fakas pvldb 11)
• Each tuple has:
– a static importance score.
• similar idea as PageRank
– a run-time relevance score.
• distance to result root
• connectivity properties to result root
• Objective: find a connected snippet of the result,
which consists of l tuples and has the maximum
score.
• Dynamic programming based solution.
Result snippet for XML: Liu TODS 10
related
FRONTEND REPRESENTATION SUMMARIZATION
83
Result Differentiation
Result 1 Result 2
event: year 2000 2012
paper: title OLAP
data
mining
cloud
scalability
search
“NEC Labs Open House”
result 1: a large table with many
people / papers / posters
result 2: a large table with many
people / papers / posters
…
results result differentiation
vs. comparing different credit cards on a bank website:
only with pre-defined features.
FRONTEND REPRESENTATION DIFFERENTIATION
84
4. Expert Search
documents in which a candidate and a topic co-occur
topics near a candidate in a document
problem solving / ticket routing history
user’s knowledge on a topic
• expert should be more knowledgeable
social relationship between expert and user
• problem solving is usually more effective if expert has a close
social relationship with user
external corpus
• many employees publish stuff externally, i.e., papers, blogs.
ways for judging an expert
Find an expert within an enterprise to solve a particular problem.
goal
FRONTEND EXPERT SEARCH
85
Classical Methods
• Builds a feature vector for each expert using various
evidence
• Ranks experts based on query, using traditional
retrieval models
candidate model
• First finds documents related to query, then locates
experts in documents
• Mimics the process a human takes.
document model
Balog CIKM 08
survey
FRONTEND EXPERT SEARCH
86
User-Oriented Model (smirnova ecir 11)
Users prefer experts who:
are more knowledgeable
than themselves.
knowledge gain: p(e|q) – p(u|q)
have a close social relationship
with themselves.
time-to-contact: shortest path
department
head
John
employees
…
e = expert
u = user
FRONTEND EXPERT SEARCH
87
Using Web Search Engine
(santos inf. process. manage. 11)
query q
result from intranet
web query q’ result from internetformulate
web query
search
intranet
corpus combine
candidate’s full name: “Jeff Smisek”
organization’s name: “IBM”
terms in q: “data integration”
excluding results from organization: “-site:ibm.com”
FRONTEND EXPERT SEARCH
88
Ticket Routing (shao kdd 08)
new ticket: DB2 login failure
transferred to group A
transferred to group B
transferred to group C
resolved
How to find the best group and
reduce problem solving time?
Markov chain model
Using only previous routing
history (not ticket content)
FRONTEND EXPERT SEARCH
89
Ticket Routing (shao kdd 08)
Pr(g|S)
possibility to route a ticket to
group g given previous groups S
Pr(g|S) includes the probability that:
• g can solve the ticket
• g can correctly re-route the ticket.
Train the Markov chain model from ticket routing history.
FRONTEND EXPERT SEARCH
Serving User Queries at Front End (52)
1. Ambiguity (29)
2. Ranking (3)
3. Representation (6)
4. Expert Search (6)
5. Privacy (8)
91
5. Privacy
It is sometimes desirable that the search engine doesn’t
know which documents a user wants to retrieve.
• For users: privacy
• For enterprises: avoiding liability
user privacy
While a search engine answers individual keyword
searches, there are methods that perform multiple
searches and, from the answers, piece together
aggregate information about underlying corpus.
• Enterprises may not want to disclose such information to all
users.
data privacy
92
User Privacy
Private Information Retrieval (PIR)
• old topic, tons of theoretical papers
Modifying search engine. e.g.,
• forcing it to forget user activities
• embellishing queries with decoy terms (Pang PVLDB 10)
Using ghost queries to obfuscate user intention (Pang ICDE 12)
• no change to search engine
• light-weight
solutions
It is sometimes desirable that the search engine doesn’t
know which documents a user wants to retrieve.
• For users: privacy
• For enterprises: avoiding liability
user privacy
93
Private Information Retrieval (PIR)
• Idea: retrieve more documents than needed.
• Naïve: retrieve the entire corpus.
• How to minimize the number of retrieved &
unneeded documents?
• Tons of theoretical papers on different variations
of the problem, e.g.,
– different computation power of the search engine
– different number of non-communicating corpus
replica.
Gasarch EATCS Bulletin 2004
survey
94
Ghost Queries (pang icde 12)
• Challenges
– Generate ghost queries on topics different from user’s
topics of interest, and make it difficult for the search
engine to infer user’s topics.
– Ghost queries need to be meaningful/realistic, so that
they cannot be easily identified.
generate
ghost queries
ghost queries
discard ghost
query results
results
submit to
search engine
user query
95
Ghost Queries (pang icde 12)
• (e1, e2) privacy model
– Given a user query, if the probability of a topic
increases more than e1, it should be reduced to
below e2 by the ghost queries.
• Topics are predefined.
• A ghost query must be coherent: all words in
the ghost query should describe common or
related topics.
• Randomized algorithm based solution.
96
Data Privacy
While a search engine answers individual keyword searches, there
are methods that perform multiple searches and, from the answers,
piece together aggregate information about underlying corpus.
• Enterprises may not want to disclose such information to all users.
data privacy
inserting dummy tuples OR randomly generating attribute values
• only applicable to structured data
disallowing certain queries OR return snippets
• search quality loss
altering a small number of results: adding dummy results;
modifying results, hiding some results (Zhang SIGMOD 12)
solutions
FRONTEND PRIVACY
97
Aggregate Suppression (zhang sigmod 12)
• Example: consider corpus A and B.
– A: n documents
– B: 2n documents
– A ⊂ B
• Goal: suppress COUNT(*), i.e., adversary cannot tell which
corpus is larger.
• Naïve approach 1: deterministically remove n documents from B.
– achieves the goal, but with search utility loss: those n documents can
never be retrieved.
• Naïve approach 2: randomly drop half of the results at run time.
– no search utility loss, but fails to achieve the goal: a clever adversary
can still get the information.
FRONTEND PRIVACY
98
Aggregate Suppression (zhang sigmod 12)
• Algorithm ideas
– carefully adjusting query degree (number of
documents matched by a query) and document
degree (number of queries matching a
document) by document hiding at run-time.
– decline a query if its result can be covered by a
small number of previous queries. Return
previous query results instead.
FRONTEND PRIVACY
99
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control/improvement
Admin
System performance
Search quality control/improvement
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
Tutorial Outline
100
Enterprise Search Administrators
• Main responsibilities
– Care and feeding of an enterprise search solution
• Monitor intranet help inboxes and respond to requests.
• Assist in troubleshooting intranet issues for content contributors
• Core skills required
– Understand general corporate business processes
– Experience in coordinating activities and managing
relationships
• with employees, content administrators, stakeholders, IT teams and
external agencies
Search Admin
Search administrators ≠≠≠≠ IR experts
Key Observation
Admin Overview
101
What a Search Administrator Need?
Bad results
for query …
I’m missing the
golden URL…
Result 22 should
be ranked much
higher!
Enterprise Users
Query Logs
Query “global
campus” seems
unsatisfying
• Understand overall search
quality
• Overall trend
• YOY change
• By segmentation
• Understand individual search
results
• Why certain result is or
isn’t brought back
• Its ranking
• Maintain search quality
• Underlying data evolves
• Terminology changes
• Policy/Business Process
changes
• Organization changes
• Hot topics
Search Admin
Admin Overview
102
Understand Search Quality
102
(Google Search analytics)
Admin Examples
103
Understand Search Quality (Google Search analytics)
Admin Examples
104
What a Search Administrator Need?
Bad results
for query …
I’m missing the
golden URL…
Result 22 should
be ranked much
higher!
Enterprise Users
Query Logs
Query “global
campus” seems
unsatisfying
• Understand overall search
quality
• Overall trend
• YOY change
• By segmentation
• Understand individual search
results
• Why certain result is or
isn’t brought back
• Its ranking
• Maintain search quality
• Underlying data evolves
• Terminology changes
• Policy/Business Process
changes
• Organization changes
• Hot topics
Search Admin
Admin Examples
105
Gumshoe Search Quality Toolkit
105
(bao cikm 12)
Admin Examples
106
Gumshoe Search Quality Toolkit
106
(bao cikm 12)
Understand
individual query
Admin Examples
107
Gumshoe Search Quality Toolkit
107
(bao cikm 12)
Examine
search results
Admin Examples
108
Gumshoe Search Quality Toolkit
108
(bao cikm 12)
Understand why a
result is returned
Admin Examples
109
Gumshoe Search Quality Toolkit
109
(bao cikm 12)
Understand the
ranking of the result
Admin Examples
110
Gumshoe Search Quality Toolkit
110
(bao cikm 12)
Investigate a
desired result
Admin Examples
111
Gumshoe Search Quality Toolkit
111
(bao cikm 12)
Suggest
rewrite rules
Admin Examples
112
Gumshoe Search Quality Toolkit
112
(bao cikm 12)
Edit runtime
rules
Admin Examples
Enterprise Search in the
Big Data Era
Case Study: IBM Intranet Search
114
Experience at IBM Internal Search
• IBM deployed a commercially available search engine
– Implementing standard IR techniques
• Search quality went down over time to the point that
Search results were unacceptable!
Success (≥ 1 relevant results): 14% on top-1, 23% on
top-5, 34% on top-50! [Zhu et al., WWW’07]
So, they implemented various solutions…
To the administrators managing the engine, exposed
control knobs were insufficient
Case Study Background
115
Attempts to Improve Search
• Enhanced link analysis by
incorporating external links
to/from external WWW
• Creative hacks: added fake terms
to documents & queries
– # terms per document determined by
“popularity”: how much TF increase
required for needed rank boost ?
• Hard-coded custom results for the
top 1200+ queries
Didn’t help…
Quality went down!
Maintenance nightmare:
Heuristic needs to be updated
upon each nontrivial change in
term stats./ranking parameters
Even bigger nightmare!
How to deal with continuously
changing terminology?
Case Study Background
116
Goals of Gumshoe
Network Station Manager search
Thin Client ManagerProduct names change:
Continually changing terminology
Domain-specific meaning
Paula Summa search
bring Paula Summa from
employee directories
per diem search
Domain-specific repetitions
popcorn search
conference call!
• Result 1: IBM Travel: Per Diem
• Result 2: IBM Travel: Per Diem Rates
• Result 3: IBM Travel: National perdiems
• Result 25: IBM Travel: Per Diem Policy
…
Gumshoe:
• Generic search solution, customizable & maintainable in many domains
– Simple customization with reasonable effort
– Ongoing search-quality management
• Philosophy: programmable search
Case Study Background
117
Programmable Search: Main Idea
• Goals:
– Transparency
• Know “precisely” why every result item is being brought back
• Understand how changes in content/intents affect search
– Maintainability and “Debugability”
• Ranking logic is guided by explicit rules
• Properly react to changes in content/intents
• Building blocks:
– Deep analytics on documents
– Domain-specific analysis of queries
– Transparent customizable rule-driven ranking
runtime rules
backendbackend
analytics
interpretations
Case Study Background
118
Distributed Analytics Platform (IBM InfoSphere BigInsights)
Crawling, information extraction, token generation (TG), indexing
Search runtime
Index
Index and rule
update services
backendbackend
analytics
runtime rulesinterpretations
backend
frontend
Implementation Architecture
Case Study Background
119
Backend Analytics: 3 Parts
Local Analysis
(per-page analysis)
Global Analysis
(cross-page analysis)
Token Generation
(TG)
index
Case Study Background
120
Local Analysis
• Categorizing pages
– Label pages by custom categories
• IBM examples: HR, person, IT help, ISSI, sales information,
marketing, corporate standards, legal & IP-law, …
– Geo classification
• Associate documents with the relevant countries & regions
• Annotating pages
– Identify HomePage annotation for people, projects,
communities, …
Simply knowing where a page is physically hosted is not enough
(example: Czech Republic hosts all pages for IBM in Europe)
Case Study Backend Local Analysis
121
• Declarative approach
– Define an operator for each basic operation
• Input tuple of annotations
• Output tuples of annotations
– Compose operators to build complex extractors
• Algebraic expression
• One document at a time trivial parallelism.
• Benefits of declarative approach:
– Expressivity: Richer, cleaner rule semantics
– Performance: Better performance through optimization
Declarative IE System
Case Study Backend Local Analysis
122
InfoSphere
Streams
Cost-based
optimization
...
SystemT – Overview
InfoSphere
BigInsights
SystemT RuntimeSystemT Runtime
Input
Documents
Extracted
Objects
SystemTSystemT
IBM Engines
UIMA
SystemT
Highly embeddable runtime
AQL Extractors
Embedded machine
learning model
AQL Rules
create view SentimentForCompany as
select T.entity, T.polarity
from classifyPolarity (SentimentFeatures ) T;
create view Company as
select ...
from ...
where ...
create view SentimentFeatures as
select
from ;
Case Study Backend Local Analysis
123
G J Chaitin Home Page
Homepage Identification
Title Extraction
Matching titleMatching title
patterns
Title
s
Dictionary
Match
Home Page for
G J Chaitin
• http://w3.ibm.com/hr/idp/
• http://w3-03.ibm.com/isc/index.html
• http://chis.at.ibm.com/
URL Extraction
URLs
Matching URLMatching URL
patterns
Homepage for: idp isc chis
Employee
directory
… many more …
Intranet
page
[Zhu et al., WWW’07]
Case Study Backend Local Analysis
124124 IBM Confidential124 IBM Confidential
Among the 38 pages with the exact same title,
which is the best for “Paula Summa”?
Role of Global Analysis
Case Study Backend Global Analysis
125
Person
Title
Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Ho Ching-Tien Tien Ho Ho, Tien
Howard Ho Ching-Tien H. ...
Global Technology Services
TG
Howard Ho Ching Tien ...
gts Global Technology Services
Global Technology Technology
Services Global Technology ...
GlobalTechnologyServices
nGramTG
spaceTG
……
…
…
…
Case Study Backend Token Generation
126
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
• Rewrite rules
• Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
• Grouping rules
• Re-ranking rules
Case Study Frontend
127
Phase 3: Result Construction
Phase 2: Relevance Ranking
Phase 1: Query Semantics
query search rewrite rules
queries
interpretations
partially ordered interpretations
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
Runtime Flow in More Details
Case Study Frontend
128
Runtime Rules: Pattern-Action Language
(Fagin 2012)
Query Pattern Queries Matching Possible Action
EQUALS
[r=ibm|information|info]
[d=COUNTRY]
• ibm germany
• info india
Rewrite into “[country] hr”
(e.g., germany hr)
ENDS_WITH installation
• acrobat installation
• db2 on aix installation
Replace installation with ISSI
(e.g., acrobat ISSI)
CONTAINS directions to
[d=SITE]
• driving directions to almaden
• directions to watson from jfk
Pages of “siteserv” category
should be ranked higher
STARTS_WITH
[d=PERSON]
• john kelly biography
• steve mills announcement
Group together pages that
represent blog entries
Pattern expression,
matched against the
keyword query
Perform when
matchQuery pattern →Action
• Similar to the query-template rules of Agarwal et al. [WWW 2010]
Query SemanticsCase Study Frontend
129129
What’s Best for Benefits?
Query SemanticsCase Study Frontend
130130
The most important IBM page for benefits
changes over time: currently it is netbenefits
What’s Best for Benefits?
Query SemanticsCase Study Frontend
131
Rewrite Rules
benefits netbenefits
Query SemanticsCase Study Frontend
132
Rewrite Rules
benefits netbenefits
interpretations
partially ordered interpretations
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
benefits, netbenefits
benefits netbenefits
rewrite rules
queries
benefits search
Query SemanticsCase Study Frontend
133
133
IBM Confidential
People with
first name Jim
How can we avoid pages
from people category?
java jim
Complex Rules
Query SemanticsCase Study Frontend
134
134
IBM Confidential
Complex Rules
java jim and not in person category
Query SemanticsCase Study Frontend
135
135
IBM Confidential
Complex Rules
java jim and not in person category
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
java search
Query SemanticsCase Study Frontend
136
InterpretationsScenario: An IBM employee wants
to download Lotus Symphony 1.3
Runtime interpretation:
download symphony 1.3 category=issi software=symphony 1.3
Query SemanticsCase Study Frontend
137
IBM Confidential
Complex Rules
java jim and not in person category
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
java search
Query SemanticsCase Study Frontend
138
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
• Rewrite rules
• Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
• Grouping rules
• Re-ranking rules
Relevance RankingCase Study Frontend
139
Person
Title
Recall: Token Generation (TG)
Annotated values Index content
Ching-Tien T. (Howard) Ho
Global Technology Services
TG
Howard Ho Ching Tien ...
gts Global Technology Services
Global Technology Technology
Services Global Technology ...
GlobalTechnologyServices
nGramTG
spaceTG
……
…
…
…
Ho Ching-Tien Tien Ho Ho, Tien
Howard Ho Ching-Tien H. ...Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
Relevance RankingCase Study Frontend
140
Annotation + TG Relevance Bucket
Howard Ho Ching Tien ...
GlobalTechnologyServices
……
Person + personNameTG
Person + nGramTG
Title + acronymTG
Title + spaceTG
Title + nGramTG
query search Relevance buckets
•Buckets are ranked
– Based on annotation type
– Based on TG quality
•A page can belong to
multiple buckets
•Within each bucket,
ranking is by
conventional IR
……
Relevance RankingCase Study Frontend
141
Ranking by Relevance Buckets
grouping rules
ordered & grouped results final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
employment verification search
Relevance RankingCase Study Frontend
142
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
• Rewrite rules
• Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
• Grouping rules
• Re-ranking rules
Result ConstructionCase Study Frontend
143
Grouping Rules
• Grouping rules define how search results should be
grouped together
• Search administrators can improve the diversity of
search results (in 1st page)
– Based on their familiarity with the data sources
Group pages of the same category
per diem travel, you-and-ibm
ANY ISSI, IT Help Central, Forum,
Bluepedia, Media Library, …
Query pattern
Result ConstructionCase Study Frontend
144
Need first page diversity
Flooding with Similar Pages
Result ConstructionCase Study Frontend
145145 IBM Confidential
per diem travel, you-and-ibm
Grouping Rule to the Rescue
Result ConstructionCase Study Frontend
146146 IBM Confidential
per diem travel, you-and-ibm
final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results
per diem search
Grouping Rule to the Rescue
Result ConstructionCase Study Frontend
147
Re-ranking Rules
• Re-ranking rules adjust ranking of
search results based on categories
• Example: search administrator specifies the
important sources of “hot/current topics”
Hot topics Rank these categories higher
Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, …
Result ConstructionCase Study Frontend
148
Bluepedia
Technical News
Homepages of
“About IBM”
Hot topics Rank these categories higher
Bluepedia, News, About-IBM
smarter planet, cloud
computing, centennial, …
Re-ranking Rule for Hot Topics
Result ConstructionCase Study Frontend
149
Re-ranking Rules for Person Queries
[d=PERSON]
executive_corner, media_library,
organization_chart, files
Result ConstructionCase Study Frontend
150150 IBM Confidential
per diem travel, you-and-ibm
final results
re-ranking rules
interpretations
partially ordered interpretations
rewrite rules
queries
interpretations execution
partially ordered results
result aggregation
ordered results
grouping rules
ordered & grouped results
per diem search
Grouping Rule to the Rescue
Result ConstructionCase Study Frontend
151
3 Phases of Runtime Flow
Search Query
Phase 1:
Query
Semantics
• Rewrite rules
• Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
• Grouping rules
• Re-ranking rules
Case Study Frontend
152
What Administrators Need…
• Search administrators have major problems
with an opaque search engine
• Programmable search provides
– Customization to the specific domain
– Ongoing search-quality management
Allows the building of search quality toolkit.
Recap:
Case Study Admin
153
Gumshoe Search Quality Toolkit!
Case Study Admin
Demo
155
Demo
Case Study Admin
156
Proof of Pudding is in the Eating
• Immediate Positive Impact within first 3 months
– Improve natural clickthrough rate by 100%+
– Top 5 results: selected about 90% of the time
• Sustained search quality Improvements 4 years since
going alive
• Stable natural search click through rate
Gumshoe (Aug. 2011– Oct. 2011)
Old Intranet Search (Aug. 2010– Aug. 2011)
Natural
clickthrough
rate
Case Study Results
157
Summary
Programmable search:
Simple & flexible customization
Search quality management
Backend Analytics
Local analysis
(per-page analysis)
Global Analysis
(cross-page analysis)
Token Generation
(TG)
[Fagin et al.,
PODS’10,
PODS’11]
Tooling
• Search provenance
• Rule suggestion
• Utilization of relevance buckets
[Li et al.,
SIGIR’06,
Zhu et al.,
WWW’07]
Phase 1:
Query Semantics
• Rewrite rules
• Query interpretation
Phase 2:
Relevance
Ranking
By relevance buckets +
conventional IR
Phase 3:
Result
Construction
• Grouping rules
• Re-ranking rules
[ Bao et al.,
ACL’2010,
SIGIR’2012
CIKM’2012]
Case Study Summary
Enterprise Search in the
Big Data Era
Future Directions
159
Search Engine Components
Backend
Collect data
Analyze data
Store and index data
Admin
System performance
Search quality control/improvement
Frontend
Interpret user query
Search index
Present results
Interact with user
index
Data
source
160
Future Directions
Data Heterogeneity
A rich variety of data types need to be searched in
enterprises.
• docs, databases, images, videos, social graphs, etc.
observations
How to automatically identify relevant data types, and
search and rank across different data types?
• e.g., for image search, should image recognition techniques
be incorporated in enterprise search engines? If so, how?
questions
161
Future Directions
Data Freshness
New data is continuously collected and published in
enterprises, the rate of which can be very fast.
Web search engines are not required to index new websites
quickly, but in enterprises, new contents may need to be
searchable asap.
observations
How to build efficient real-time indexes to ensure data
freshness in enterprise search?
questions
162
Future Directions
Search Context
Enterprise search users have richer profiles than web users.
• activities, bio, position, projects, experiences, etc.
observations
How to utilize users’ contexts to provide customized results?
Is it possible to predict the information a user may want, and
push it to the user?
questions
163
Future Directions
User Preference
Different users in an enterprise have different expertise, and
may prefer different ways to express queries.
• e.g., some users prefer pure keyword search, while
others may want lightly-structured queries.
observations
How to effectively satisfy different needs for expressing
queries for different users?
questions
164
Future Directions
Question Answering
The purpose of many enterprise searches are to find
answers to questions.
• e.g., what is the previous name of a product, and when
did we change to the current name?
observations
Is it possible to effectively use natural language processing
techniques and domain knowledge to automatically answer
natural language questions?
questions
165
Future Directions
Transactional Search
Over 1/3 of enterprise search queries is transactional. It will
be desirable if enterprise search engines can recommend
business processes to accomplish a certain task given a
transactional search.
• E.g., given a customer’s lengthy complaint letter, how to find
out the departments relevant to the complaints.
observations
How to better support transactional search? How to initiate
a business process based on the results of a search?
questions
166
Future Directions
Big Data Analytics
Rich information and knowledge lies in big data. Many
employees (not just data analysts) may benefit from the
ability to perform analytics on the company’s big data.
observations
How to build a low-cost, interactive platform that allows a
large number of employees to issue analytical queries?
How to give employees the capabilities to analyze big data,
if they have little knowledge of SQL or MapReduce
programming?
questions
167
Future Directions
Tooling for Search Quality Maintenance
Most enterprise search engines have to be manually
evaluated and tuned by a search administrator with domain
knowledge, in an ad-hoc fashion.
observations
Can we automate this process, or at least minimize manual
involvement?
Can we fully utilize explicit user feedbacks?
• Explicit user feedbacks are easier to obtain in enterprise
search, and there are less spams.
questions
Thanks.
Acknowledgement:
IBM Research: Sriram Raghavan, Fred Reiss, Shiv Vaithyanathan, Ron Fagin
IBM CIO’s Office: Nicole Dri, Brian C. Meyer
LogicBlox: Benny Kimelfeld*
TripAdvisor: Adriano Crestani Campos*
Facebook: Zhuowei Bao*
NJIT: Yi Chen
UNSW: Wei Wang
* work done while at IBM

Mais conteĂşdo relacionado

Mais procurados

Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data ModelingVital.AI
 
Automatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise searchAutomatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise searchYunyao Li
 
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...semanticsconference
 
Modern Data Discovery and Integration in Insurance
Modern Data Discovery and Integration in InsuranceModern Data Discovery and Integration in Insurance
Modern Data Discovery and Integration in InsuranceCambridge Semantics
 
Applied Enterprise Semantic Search 201305
Applied Enterprise Semantic Search 201305Applied Enterprise Semantic Search 201305
Applied Enterprise Semantic Search 201305Mark Tabladillo
 
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINEFelix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINEsemanticsconference
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Investment Fund Analytics
Investment Fund AnalyticsInvestment Fund Analytics
Investment Fund AnalyticsBernardo Najlis
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingDataWorks Summit
 
Deep learning for e-commerce: current status and future prospects
Deep learning for e-commerce: current status and future prospectsDeep learning for e-commerce: current status and future prospects
Deep learning for e-commerce: current status and future prospectsRakuten Group, Inc.
 
Going Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph AnalyticsGoing Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph AnalyticsCambridge Semantics
 
Applied Semantic Search 201306
Applied Semantic Search 201306Applied Semantic Search 201306
Applied Semantic Search 201306Mark Tabladillo
 
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...semanticsconference
 
Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...
Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...
Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...Lucidworks
 
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...Cambridge Semantics
 
Semantic Technology in Publishing & Finance
Semantic Technology in Publishing & FinanceSemantic Technology in Publishing & Finance
Semantic Technology in Publishing & FinanceVladimir Alexiev, PhD, PMP
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - AmundsenPhilippe Mizrahi
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...semanticsconference
 
Graph-Powered Machine Learning
Graph-Powered Machine LearningGraph-Powered Machine Learning
Graph-Powered Machine LearningDatabricks
 

Mais procurados (20)

Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
Automatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise searchAutomatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise search
 
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
 
Modern Data Discovery and Integration in Insurance
Modern Data Discovery and Integration in InsuranceModern Data Discovery and Integration in Insurance
Modern Data Discovery and Integration in Insurance
 
Applied Enterprise Semantic Search 201305
Applied Enterprise Semantic Search 201305Applied Enterprise Semantic Search 201305
Applied Enterprise Semantic Search 201305
 
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINEFelix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Investment Fund Analytics
Investment Fund AnalyticsInvestment Fund Analytics
Investment Fund Analytics
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language Processing
 
Deep learning for e-commerce: current status and future prospects
Deep learning for e-commerce: current status and future prospectsDeep learning for e-commerce: current status and future prospects
Deep learning for e-commerce: current status and future prospects
 
Going Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph AnalyticsGoing Beyond Rows and Columns with Graph Analytics
Going Beyond Rows and Columns with Graph Analytics
 
Applied Semantic Search 201306
Applied Semantic Search 201306Applied Semantic Search 201306
Applied Semantic Search 201306
 
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
Stephen Buxton | Data Integration - a Multi-Model Approach - Documents and Tr...
 
Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...
Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...
Scaling Box-Search: Gearing up for Petabyte Scale - Shubhro Roy & Anthony Urb...
 
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...
Knowledge Graph Discussion: Foundational Capability for Data Fabric, Data Int...
 
Semantic Technology in Publishing & Finance
Semantic Technology in Publishing & FinanceSemantic Technology in Publishing & Finance
Semantic Technology in Publishing & Finance
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
Kerstin Diwisch | Towards a holistic visualization management for knowledge g...
 
Graph-Powered Machine Learning
Graph-Powered Machine LearningGraph-Powered Machine Learning
Graph-Powered Machine Learning
 

Destaque

Enterprise Search: How do we get there from here?
Enterprise Search: How do we get there from here?Enterprise Search: How do we get there from here?
Enterprise Search: How do we get there from here?Daniel Tunkelang
 
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesNicolas Morales
 
Results from the Enterprise Search and Findability Survey 2012
Results from the Enterprise Search and Findability Survey 2012Results from the Enterprise Search and Findability Survey 2012
Results from the Enterprise Search and Findability Survey 2012Findwise
 
OK So Enterprise Search is "Janky" - Now What?
OK So Enterprise Search is "Janky" - Now What?OK So Enterprise Search is "Janky" - Now What?
OK So Enterprise Search is "Janky" - Now What?Earley Information Science
 
Enterprise Knowledge Graph
Enterprise Knowledge GraphEnterprise Knowledge Graph
Enterprise Knowledge GraphLukas Masuch
 
Groundbreaking and Game-changing Enterprise Search Webinar
Groundbreaking and Game-changing Enterprise Search WebinarGroundbreaking and Game-changing Enterprise Search Webinar
Groundbreaking and Game-changing Enterprise Search WebinarConcept Searching, Inc
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 

Destaque (7)

Enterprise Search: How do we get there from here?
Enterprise Search: How do we get there from here?Enterprise Search: How do we get there from here?
Enterprise Search: How do we get there from here?
 
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data Technologies
 
Results from the Enterprise Search and Findability Survey 2012
Results from the Enterprise Search and Findability Survey 2012Results from the Enterprise Search and Findability Survey 2012
Results from the Enterprise Search and Findability Survey 2012
 
OK So Enterprise Search is "Janky" - Now What?
OK So Enterprise Search is "Janky" - Now What?OK So Enterprise Search is "Janky" - Now What?
OK So Enterprise Search is "Janky" - Now What?
 
Enterprise Knowledge Graph
Enterprise Knowledge GraphEnterprise Knowledge Graph
Enterprise Knowledge Graph
 
Groundbreaking and Game-changing Enterprise Search Webinar
Groundbreaking and Game-changing Enterprise Search WebinarGroundbreaking and Game-changing Enterprise Search Webinar
Groundbreaking and Game-changing Enterprise Search Webinar
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 

Semelhante a Enterprise Search in the Big Data Era: Recent Developments and Open Challenges

Take Cloud Hybrid Search to the Next Level
Take Cloud Hybrid Search to the Next LevelTake Cloud Hybrid Search to the Next Level
Take Cloud Hybrid Search to the Next LevelJeff Fried
 
SharePoint NYC search presentation
SharePoint NYC search presentationSharePoint NYC search presentation
SharePoint NYC search presentationjtbarrera
 
Session #2, tech session: Build realtime search by Sylvain Utard from Algolia
Session #2, tech session: Build realtime search by Sylvain Utard from AlgoliaSession #2, tech session: Build realtime search by Sylvain Utard from Algolia
Session #2, tech session: Build realtime search by Sylvain Utard from AlgoliaSaaS Is Beautiful
 
Data Model for Mainframe in Splunk: The Newest Feature of Ironstream
Data Model for Mainframe in Splunk: The Newest Feature of IronstreamData Model for Mainframe in Splunk: The Newest Feature of Ironstream
Data Model for Mainframe in Splunk: The Newest Feature of IronstreamPrecisely
 
Focused Crawling for Structured Data
Focused Crawling for Structured DataFocused Crawling for Structured Data
Focused Crawling for Structured DataRobert Meusel
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Petter Skodvin-Hvammen
 
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...Agnes Molnar
 
Solving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBSolving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBMongoDB
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB
 
awari-ds-aula1.pdf
awari-ds-aula1.pdfawari-ds-aula1.pdf
awari-ds-aula1.pdfMarcos993896
 
How did it go? The first large enterprise search project in Europe using Shar...
How did it go? The first large enterprise search project in Europe using Shar...How did it go? The first large enterprise search project in Europe using Shar...
How did it go? The first large enterprise search project in Europe using Shar...Petter Skodvin-Hvammen
 
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...MongoDB
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward
 
CREATE SEARCH DRIVEN BUSINESS INTELLIGENCE APPLICATION USING FAST SEARCH FO...
CREATE SEARCH DRIVEN BUSINESS  INTELLIGENCE APPLICATION USING  FAST SEARCH FO...CREATE SEARCH DRIVEN BUSINESS  INTELLIGENCE APPLICATION USING  FAST SEARCH FO...
CREATE SEARCH DRIVEN BUSINESS INTELLIGENCE APPLICATION USING FAST SEARCH FO...Netwoven Inc.
 
The Enterprise Search Market in a Nutshell
The Enterprise Search Market in a NutshellThe Enterprise Search Market in a Nutshell
The Enterprise Search Market in a NutshellDr. Haxel Consult
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
Understanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid SearchUnderstanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid SearchJeff Fried
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 

Semelhante a Enterprise Search in the Big Data Era: Recent Developments and Open Challenges (20)

Take Cloud Hybrid Search to the Next Level
Take Cloud Hybrid Search to the Next LevelTake Cloud Hybrid Search to the Next Level
Take Cloud Hybrid Search to the Next Level
 
SharePoint NYC search presentation
SharePoint NYC search presentationSharePoint NYC search presentation
SharePoint NYC search presentation
 
Session #2, tech session: Build realtime search by Sylvain Utard from Algolia
Session #2, tech session: Build realtime search by Sylvain Utard from AlgoliaSession #2, tech session: Build realtime search by Sylvain Utard from Algolia
Session #2, tech session: Build realtime search by Sylvain Utard from Algolia
 
Data Model for Mainframe in Splunk: The Newest Feature of Ironstream
Data Model for Mainframe in Splunk: The Newest Feature of IronstreamData Model for Mainframe in Splunk: The Newest Feature of Ironstream
Data Model for Mainframe in Splunk: The Newest Feature of Ironstream
 
Focused Crawling for Structured Data
Focused Crawling for Structured DataFocused Crawling for Structured Data
Focused Crawling for Structured Data
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
 
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
SPConnections Amsterdam: Beyond the Search Center - Application or Solution? ...
 
Solving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDBSolving the Disconnected Data Problem in Healthcare Using MongoDB
Solving the Disconnected Data Problem in Healthcare Using MongoDB
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
awari-ds-aula1.pdf
awari-ds-aula1.pdfawari-ds-aula1.pdf
awari-ds-aula1.pdf
 
How did it go? The first large enterprise search project in Europe using Shar...
How did it go? The first large enterprise search project in Europe using Shar...How did it go? The first large enterprise search project in Europe using Shar...
How did it go? The first large enterprise search project in Europe using Shar...
 
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
 
SharePoint Custom Development
SharePoint Custom DevelopmentSharePoint Custom Development
SharePoint Custom Development
 
CREATE SEARCH DRIVEN BUSINESS INTELLIGENCE APPLICATION USING FAST SEARCH FO...
CREATE SEARCH DRIVEN BUSINESS  INTELLIGENCE APPLICATION USING  FAST SEARCH FO...CREATE SEARCH DRIVEN BUSINESS  INTELLIGENCE APPLICATION USING  FAST SEARCH FO...
CREATE SEARCH DRIVEN BUSINESS INTELLIGENCE APPLICATION USING FAST SEARCH FO...
 
The Enterprise Search Market in a Nutshell
The Enterprise Search Market in a NutshellThe Enterprise Search Market in a Nutshell
The Enterprise Search Market in a Nutshell
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
Understanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid SearchUnderstanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid Search
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 

Mais de Yunyao Li

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopYunyao Li
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and ApplicationsYunyao Li
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table UnderstandingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language UnderstandingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Yunyao Li
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesYunyao Li
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaYunyao Li
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningYunyao Li
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingYunyao Li
 
Coling poster
Coling posterColing poster
Coling posterYunyao Li
 
Coling demo
Coling demoColing demo
Coling demoYunyao Li
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Yunyao Li
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsYunyao Li
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionYunyao Li
 
Information Extraction --- An one hour summary
Information Extraction --- An one hour summaryInformation Extraction --- An one hour summary
Information Extraction --- An one hour summaryYunyao Li
 
Adaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text NormalizationAdaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text NormalizationYunyao Li
 

Mais de Yunyao Li (20)

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and Applications
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural Languages
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social Media
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active Learning
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role Labeling
 
Coling poster
Coling posterColing poster
Coling poster
 
Coling demo
Coling demoColing demo
Coling demo
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity Detection
 
Information Extraction --- An one hour summary
Information Extraction --- An one hour summaryInformation Extraction --- An one hour summary
Information Extraction --- An one hour summary
 
Adaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text NormalizationAdaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text Normalization
 

Último

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervĂŠ Boutemy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Último (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Enterprise Search in the Big Data Era: Recent Developments and Open Challenges

  • 1. Enterprise Search in the Big Data Era Yunyao Li Ziyang Liu Huaiyu Zhu IBM Research - Almaden NEC Labs IBM Research - Almaden
  • 2. 1 Enterprise Search • Providing intuitive access to an organization’s various digital content 1 Report Find • IDC report [IDC 05] • $5k/person/year wasted salary due to poor search • 9-10hr/person/week doing search • unsuccessful 1/3-1/2 of the time • Butler Group [Edwards 06] • 10% of salary cost wasted through ineffective search • Accenture survey [Accenture 07] • Middle managers spend 2 hr/day searching • >50% of what they found have not value • Hawking, Enterprise Search, http://david-hawking.net/pubs/ModernIR2_Hawking_chapter.pdf • [IDC 05] the enterprise workplace: How it will change the way we work”. IDC Report 32919 • [Edwards 06] www.butlergroup.com/pdf/PressReleases/ESRReportPressRelease.pdf • [Accenture 07] http://newsroom.accenture.com/article_display.cfm?article_id=4484
  • 3. 2 Magic Search from User’s Point of View Results 1 …………….. 2 ………. 3 …………….. 4 ………. ………… …………… INTRODUCTION SEARCH
  • 4. 3 What Happens Behind the Scene Backend Collect data Analyze data Index data Frontend Serve user queries Return results Index Data Source INTRODUCTION SEARCH
  • 5. 4 How Does a Query Match a Document? Index Document ……………….. ………… ………… … ….. …….. ………………… ………… Document ……………….. ………… ………… … ….. …….. ………………… ………… Results Doc 1 ……….. Doc 2 ……. Doc 3 ………….. Doc 4 ………. ………… …………… Analyze query Present results Analyze document Search index Build index INTRODUCTION SEARCH
  • 6. 5 Search Is More Than Keyword Match • Specific features in documents are important – Title, url, person name, product, actions, … • Features combine to form higher level concepts – In document: Home page + person personal homepage – Cross document: URL link analysis, … • The string representation in document may not match that in user query – Person name: Bill Clinton William Jefferson Clinton • User queries may be ambiguous – Multiple interpretations • Presenting the results to user – Ranking, grouping, interactive refinement INTRODUCTION SEARCH
  • 7. 6 Internet vs Enterprise – Web data [Fagin WWW2003] Internet Enterprise Creation of content • Democratic • Appealing to reader • Links approval • Bureaucratic • Conform to mandate • Links internal structure Relevant query results • Large number • Overlapping information • Reasonable subset suffices • Ranking is more universal • Small number • Specific function • Specific pages required • Ranking is relative to query Spamming • Spam infested • Ranking can only be based on external authority • Mostly spam-free • Ranking based on content or metadata are reliable Search engine friendliness • Web pages designed to be search results • Web page document • Documents not designed to be search results • Special treatment INTRODUCTION ENTERPRISE VS INTERNET
  • 8. 7 Internet vs Enterprise – Big Data Internet Enterprise Content being searched • Sources: Web crawl • Formats: html, xml, pdf, • Variety of sources • Variety of formats: • Email, database, application- specific access and formats Search queries /expected results • Target: web pages, office documents • Expect list of documents • Expect little personalization • Return result directly • Target: rows, figures, experts, ... • Expect customized results • Personalization required: geography, access, • Customize results Related information • Link approval • Small number of domain- specific knowledge • Generic analysis • Link organization structure • Large number of dynamic domain-specific knowledge • Highly specialized analysis Skill set of search admins • Large number of admins • Search experts • Facilitate update of search algorithms • Small number of admins • Domain experts • Facilitate use of domain knowledge INTRODUCTION ENTERPRISE VS INTERNET
  • 9. 8 Search Engine Components Backend Collect data Analyze data Store and index data Admin System performance Search quality control/improvement Frontend Interpret user query Search index Present results Interact with user index Data source INTRODUCTION TUTORIAL OVERVIEW COMPONENTS
  • 10. 9 Search Engine Architecture Backend Collect data Analyze data Store and index data Backend Collect data Analyze data Store and index data Admin System performance Search quality control Frontend Interpret user query Search index Present results Interact with user index Data source
  • 11. 10 Main Backend Functions Analysis (Understand) Information extraction Analyse and transform data Indexing (Prepare for search) Generate terms suitable for match queries Index search terms index Document Ingestion (Collect) Collect all the data to be searched Transform and store as documents Local Analysis (in-document analysis) Global Analysis (cross-document analysis)
  • 12. 11 Backend Section Outline • Overview • Data Ingestion • Local analysis • Global analysis • Indexing
  • 13. 12 Typical analytics pipeline S1={f11, f12, …} S2={f21, f22, …} S3={f31, f32, …} G1 = {g1, …} G2 = {g2, g3, …} LA GA Idx Data ingestion • Collect data • Transform to uniform document format • Store in document store Data ingestion • Collect data • Transform to uniform document format • Store in document store Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Global analysis • Cross document analysis • Rank, group, merge, and filter documents Global analysis • Cross document analysis • Rank, group, merge, and filter documents index Indexing • Generate search terms, • Index documents by search terms Indexing • Generate search terms, • Index documents by search terms Local analysis: • Information extraction from each document Local analysis: • Information extraction from each document DI BACKEND OVERVIEW
  • 14. 13 Digression: Classical IR S1={f11, f12, …} S2={f21, f22, …} S3={f31, f32, …} G1 = {g1, …} G2 = {g2, g3, …} LA GA Idx Data ingestion • Given set of files Data ingestion • Given set of files Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Global analysis • Calculate statistics of terms in documents Global analysis • Calculate statistics of terms in documents index Indexing • Generate search terms, • Index by terms with statistics Indexing • Generate search terms, • Index by terms with statistics Local analysis: • Tokenize • Stop wording • Stemming • Form n-grams Local analysis: • Tokenize • Stop wording • Stemming • Form n-grams DI BACKEND OVERVIEW
  • 15. 14 Digression: Classical Web search S1={f11, f12, …} S2={f21, f22, …} S3={f31, f32, …} G1 = {g1, …} G2 = {g2, g3, …} LA GA Idx Data ingestion • Crawl web pages Data ingestion • Crawl web pages Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Global analysis • Calculate eigenvalues of connection matrix Global analysis • Calculate eigenvalues of connection matrix index Indexing • Generate search terms • Index documents by search terms, with page rank Indexing • Generate search terms • Index documents by search terms, with page rank Local analysis: • Extract out links Local analysis: • Extract out links DI BACKEND OVERVIEW
  • 16. 15 Demands of Enterprise Search S1={f11, f12, …} S2={f21, f22, …} S3={f31, f32, …} G1 = {g1, …} G2 = {g2, g3, …} LA GA Idx Data ingestion • Handle variety of sources • Handle variety of formats • Deal with access policy • Deal with update policy Data ingestion • Handle variety of sources • Handle variety of formats • Deal with access policy • Deal with update policy Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Global analysis • Cross document analysis • Rank, group, merge, and filter documents Global analysis • Cross document analysis • Rank, group, merge, and filter documents index Indexing • Generate search terms, • Index documents by search terms Indexing • Generate search terms, • Index documents by search terms Local analysis: • Incorporate domain knowledge • Extract rich set of semantics • Categorize documents Local analysis: • Incorporate domain knowledge • Extract rich set of semantics • Categorize documents DI BACKEND OVERVIEW
  • 17. 16 • Efficient incremental updates – Fast turn around time for updates • System performance and reliability – Scaling with data size and resource available – Fault tolerance • Ease of administration quality improvement – Allow search admin to customize domain specific configurations BACKEND OVERVIEW CHALLENGES / OPPORTUNITIES Desiderata of backend
  • 18. 17 Backend Section Outline • Overview • Data Ingestion • Local analysis • Global analysis • Indexing
  • 19. 18 Data Ingestion BACKEND DATA INGESTION Doc. Store Crawl/push Web DB App Convert to document … Convert to text From: xxx To: yyy Date: 12/21 ……………….. … …… ………… … … …….. Attch: file1.pdf Docid: 0002 ___________ …….ABCD….. … 01/12 ………… … … …….. ……… ……….. … ……… Docid: 0001 ___________ From: xxx To: yyy Date: 12/21 ……………….. … …… ………… … … …….. Attch: file.pdf Email +attach Docid: 0002 ___________ title: ABCD. Date: 01/12 ………… … … …….. ……… ……….. … ……… Docid: 0001 ___________ From: xxx To: yyy Date: 12/21 ……………….. … …… ………… … … …….. Attch: file.pdf Variety of sources Support update & retention policy Pdf file
  • 20. 19 Document-centric View • Data as a collection of documents – Document as unit of storage and search result. – Three major components • Unique document identifier in the whole system • Metadata fields: url, date, language, … • Content field: text to be searched • Representation of data of different structures – Web pages Each page is a document – Relational data Each row is a document – Hierarchical data Each node is a document BACKEND DATA INGESTION
  • 21. 20 Push vs Pull Pull Push Definition • Search engine initiate transfer of data • (Web crawler) • Content owner initiate transfer of data • (Apps with push notice) Advantage • Operated by search engine • Use standard crawlers • Can handle special access methods • Easy to adjust refresh rate • Easy to handle special format Disadvantage • Difficult to access special data sources • Difficult to adjust domain specific treatment • Need synchronization with content owner Applicability • Prevalent for Internet • Also useful for enterprise • Rare for Internet • Very important for enterprise BACKEND DATA INGESTION
  • 22. 21 Transform the Data • Format conversion – Convert content to text: pdf, doc, … • Keep as much structure as possible • Metadata conversion – Obtain and transform metadata: HTTP headers, DB table metadata, … • Merge /split documents – One-to-many: Zip file, email thread, attachments – Many-to-one: social tags merge to original doc BACKEND DATA INGESTION
  • 23. 22 Storage options Options Pro Con SQL database • Traditional RDBM strengths • Support insert, update, delete, fielded query, • Too much system overhead Indexing engine (Lucene) • Closer to document centric view • Supports insert, delete, fielded query • No direct in-document update • Need special treatment for distributed processing NoSQL databases • Light weight • Sufficient for simple use • May lack features in the future • Transaction? BACKEND DATA INGESTION Issues to consider • In document update • Access/Retention policy • Parallel processing
  • 24. 23 Backend Section Outline • Overview • Data Ingestion • Local analysis • Global analysis • Indexing
  • 25. 24 Local Analysis • Annotating pages – Extract structured elements: title, header, … – Extract features for people, projects, communities, … – Extract features for cross-document analysis. • Categorizing pages – Label by standard categories • Language, geography, date, … – Label pages by custom categories • IBM examples: HR, person, IT help, ISSI, sales information, marketing, corporate standards, legal & IP-law, … Local analysis is essentially information extraction BACKEND LOCAL ANALYSIS
  • 26. 25 Rule-based IE ML-based IE PRO • Declarative • Easy to comprehend • Easy to maintain • Easy to incorporate domain knowledge • Easy to debug • Trainable • Adaptable • Reduces manual effort CON • Heuristic • Requires tedious manual labor • Requires labeled data • Requires retraining for domain adaptation • Requires ML expertise to use or maintain • Opaque (not transparent) BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION Rule-based vs. Learning-based IE
  • 27. 26 Commercial Vendors (2013) NLP Papers (2003-2012) 100% 50% 0% 3.5% 21% 75% Rule- Based Hybrid Machine Learning Based 45% 22% 33% Large Vendors 67% 17% 17% All Vendors • GATE Information Extraction • IBM InfoSphere BigInsights • Microsoft FAST • SAP HANA • SAS Text Analytics • HP Autonomy • Attensity • Clarabridge Example Industrial Systems Source: [CLR2013] Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!, EMNLP 2013 BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION Landscape of Entity Extraction Implementations
  • 28. 27 Intranet page NavPanel Extraction NavPanels Self link identification Title Extraction Matching title patterns Titles Dictionary Match Person name dictionary Person name in title Title Extraction Matching title patterns Titles Title Name URL Extraction URLs Matching URL patterns URL Name Person name dictionary = employee directory IBM Global Services Security Home IBM Global Services Security G J Chaitin Home Page G J Chaitin 1. http://w3-03.ibm.com/marketing/ 2. http://w3-03.ibm.com/isc/index.html 3. http://chis.at.ibm.com/ 1. marketing 2. isc 3. chis BACKEND LOCAL ANALYSIS EXAMPLES [Zhu et al., WWW’07] Local analysis for different features
  • 29. 28 Consolidation – Example: Document language consolidation • HTTP header Accept-Language: en-us,en;q=0.5 • Meta tags <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> • Document text encoding • URL http://enterprise.com/hr/benefits/us/ca/ BACKEND LOCAL ANALYSIS TRANFORMATIONS
  • 30. 29 Backend Section Outline • Overview • Data Ingestion • Local analysis • Global analysis • Indexing
  • 31. 30 Global Analysis • Deduplication – Save resources, reduce result clutter • Identify root of URL hierarchy – Used for result grouping and ranking • Anchor text analysis – Assign external labels to documents • Social tagging analysis – Assign tags and their weights to documents • Identify different versions of the same document – Due to variations in date, language, … • Enterprise-specific global analysis – When certain documents co-exists, do this … • … BACKEND GLOBAL ANALYSIS
  • 32. 31 Shingle based deduplication (Leskovec, http://www.mmds.org/) S1={s1, s2, …} S2={s1, s3, …} S3={s2, s3, …} {h1(S1), h2(S2), …} {h1(S2), h2(S2), …} {h1(S3), h2(S2), …} Document ……………….. … …… ………… … … …….. ………… ……… …… …… Shingles: • Character or token n-gram • Possibly stemmed • Possibly related to stop words Shingles: • Character or token n-gram • Possibly stemmed • Possibly related to stop words Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Minhash: • Maps sets to integers • Based on permutation of universal set Jaccard distance : Theorem: The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets Minhash: • Maps sets to integers • Based on permutation of universal set Jaccard distance : Theorem: The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets | A∊B | / | A∪B | More diverse set of documents. More precise. BACKEND GLOBAL ANALYSIS DEDUPLICATION
  • 33. 32 Metadata-based deduplication (IBM Gumshoe search engine) S1=[h11, h12, …] S2=[h21, h22, …] S3=[h31, h32, …] G1 = {S1, …} G2 = {S2, S3, …} Document ……………….. … …… ………… … … …….. ………… ……… …… …… Significant metadata: • Document title • Section headers • Signatures from URL Ensure that all similar candidates have the same signature Significant metadata: • Document title • Section headers • Signatures from URL Ensure that all similar candidates have the same signature Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Group by signature • Perform detailed analysis In-group similarity analysis: • Analyze documents within candidate groups Group by signature • Perform detailed analysis In-group similarity analysis: • Analyze documents within candidate groups More customizable for intranet. Less cost. BACKEND GLOBAL ANALYSIS DEDUPLICATION
  • 34. 33 URL Root Analysis (Zhu et al., WWW’07) host1/b/a/~user1/pub host1/b/a host1/b/a/~user1/ host1/b/c host1/b/a/x_index.htm/ host1/b/c/d host1/b/c/home.html host1/b/c/d/e/index.html?a=us host1/b/c/d/e/index.html?a=uk host1/b/c/d/e/index.html • Given a set of documents all with the same value V of feature X. • E.g., At one time all webpages from IBM Tucson site had the same title • Find the roots of URL forest. These will be preferred result for query X=V. • E.g., when searching for “Tuscon home page”, only the IBM Tuscon homepage will match. BACKEND GLOBAL ANALYSIS ROOT ANALYSIS
  • 35. 34 Label Assignment (Zhu et al., WWW’07) BACKEND GLOBAL ANALYSIS LABEL ASSIGNMENT Document B ……………….. … …… ………… … … …….. ………… ……… …… …… Document A1 ……………….. … X home … ………… … … …….. ………… ……… …… ……Document A2 ……………….. … X home … ………… … … …….. ………… ……… …… …… Bookmark C1 X home Anchor text global analysis: • Assign label “X” and / or “Y” based on frequency Bookmark C2 X Bookmark C3 Y home Document A2 ……………….. … X home … ………… … … …….. ………… ……… …… …… Social tagging global analysis: • Assign label “X home”, “X”, and “Y home” based on frequency
  • 36. 35 Entity Integration using HIL Entity Population Rules • Create entities (from raw records, other entities, and links) • Clean, normalize, aggregate, fuse Various data sources Information Extraction Entity Resolution Fuse Aggregate Entity Integration Entity Resolution Rules • Create links between raw records or entities Map Unstructured Data Unified entities Defines entity types (the logical data model of the integration flow) (SQL-like) rules to specify the integration logic Raw Records HIL [HernĂĄndez et al, EDBT’13] Declarative IE (IBM SystemT) [Chiticariu et al, ACL 2010] Optimizing compiler to Big Data runtime (Jaql and Hadoop) BACKEND GLOBAL ANALYSIS ENTITY INTEGRATION
  • 37. 36 Backend Section Outline • Overview • Data Ingestion • Local analysis • Global analysis • Indexing
  • 38. 37 Indexing • Generate and index search terms, to be matched by terms generated at runtime from user queries. • Challenges: – Extracted terms do not match user query terms • Morphological changes, synonyms, … – Importance of term depends on query • Needs for bucketing of indexes, … – Support of incremental indexing BACKEND INDEXING
  • 39. 38 Term normalization • Example: Date time normalization – Given any of these Wed Aug 27 10:06:11 PDT 2014 27 Aug 2014, 10:06:11 2014-08-27T10:06:11-07:00 27 Aug 2014 1409133971 – Normalize to 2014-08-27T10:06:11-07:00 – Other examples: Person names, product names, … BACKEND INDEXING TERM NORMALIZATION
  • 40. 39 Why Generate Variant Terms? • Extracted feature string ≠ query string – People names • Document: John Doe Search: Doe, John Search: J Doe – Acronym expansions • gts Global Technology Services – N-gram variant generation • Title: reimbursement of travel expenses • Terms: reimbursement, travel expenses, reimbursement travel, reimbursement of travel, reimbursement expenses • Normalization is not sufficient solution – People names • Document: John Doe J. Doe Search: Jean Doe J. Doe • These are not supposed to match • Solution: – Generate variant terms with different levels of approximation. BACKEND INDEXING VARIANT TERM GENERATION
  • 41. 40 Configurable Term Generation • Configuration knobs determine the set of outputs • Given “Mr. John (Jack) M. Doe Jr.” – Configuration1: Initial=both, Dot: with, NickName: both, MiddleName: both, NameSuffix: without, Title: without, Comma:both John M. Doe Doe, John M. John Doe Doe, John J. M. Doe Doe, J. M. J. Doe Doe, J. Jack M. Doe Doe, Jack M. Jack Doe Doe, Jack – Configuration2 (normalization): Initial=without, Dot: without, NickName: without, MiddleName: without, NameSuffix:without, Title: without, Comma: without John Doe BACKEND INDEXING VARIANT TERM GENERATION
  • 42. 41 Enterprise Search Backend S1={f11, f12, …} S2={f21, f22, …} S3={f31, f32, …} G1 = {g1, …} G2 = {g2, g3, …} LA GA Idx Data ingestion • Access various sources • Document transform • Format transform Data ingestion • Access various sources • Document transform • Format transform Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Document ……………….. … …… ………… … … …….. ………… ……… …… …… Global analysis • Deduplication • URL root analysis • Label assignment • … Global analysis • Deduplication • URL root analysis • Label assignment • … index Indexing • Generate search terms, Indexing • Generate search terms,Local analysis: • Information extraction • Configurable Local analysis: • Information extraction • Configurable DI BACKEND RECAP
  • 43. 42 Search Engine Architecture Backend Collect data Analyze data Store and index data Backend Collect data Analyze data Store and index data Admin System performance Search quality control Frontend Interpret user query Search index Present results Interact with user Frontend Interpret user query Search index Present results Interact with user index Data source
  • 44. Serving User Queries at Front End (52) 1. Ambiguity (29) 2. Ranking (3) 3. Representation (6) 4. Expert Search (6) 5. Privacy (8)
  • 45. 44 1. Ambiguity • Optimal keywords may not be used. – Misspelled • “datbase” – Under-specified • polysemy: “java” • too general: “database papers” – Over-specified: • synonyms, acronyms, abbreviations & alternative names: “green card” ≡ “permanent residency” • too specific: “MS Office 2007 for Mac x64 edition” – Non-quantitative: • “small laptop” query cleaning query autocompletion query refinement query rewriting query rewriting
  • 46. 45 Summary of Solutions • query cleaning – correct various types of spelling errors • query autocompletion – prevent spelling errors. • query refinement – making queries more specific, returning fewer results. • query rewriting – making queries more general / on-topic, returning more relevant results. • query forms – enabling users to specify precise queries FRONTEND AMBIGUITY
  • 47. 46 Graph-based Spelling Correction (bao acl 11) • Repartition the query. – Each partition (token) should be plausible: confidence (correcting it) > threshold. – confidence: linear combination of multiple scores, parameters learned from SVM. • Domain knowledge is often used in calculating confidence. • For each partition, generate candidate corrections with high scores. “enterpricsea rch” “enterpricse arch” “enterpric search” “enter pric search” etc. price: 0.8 prim: 0.6 etc. pric QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY “enterpricsea rch”
  • 48. 47 Graph-based Spelling Correction (bao acl 11) • Build a graph that connects candidate corrections. • Each full path is a candidate query. – Find k top-weighted full paths enterprise enter price prim arc sea rich search 1. correction score (node weight) 2. merge penalty (node weight) 3. split penalty (edge weight) enterprise → search enter → price → sea → rich e.g., weights QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY price: 0.8 prim: 0.6 etc. pric “enterpricsea rch”
  • 49. 48 Graph-based Spelling Correction (bao acl 11) • Weight doesn’t consider term correlations. • Calculate a score for each path – Score includes term correlations. • This ensures the cleaned query has good quality results. • Correlations are computed based on number of co- occurrences. • Finally returns paths with high scores. e.g., correlation(“enterprise search”) > correlation (“enterprise arc”) QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY e.g., “enterprise search” vs. “enterprise arc”
  • 50. 49 XClean (lu icde 11) – based on the noisy channel model that finds the intended word given the user’s input word. – results on XML are subtrees rooted at entity nodes. • A result quality score is calculated for each entity node in T, and then aggregated. • e.g., if Johnny and Mike works in the same department, then “Johnn, Mike” → “Johnny, Mike” rather than “John, Mike”. – processes each word individually, i.e., no merge or split. Query Cleaning on Relational Data: Pu VLDB 08 related department head Johnny employees … QUERY CLEANING STRUCTURED DATAFRONTEND AMBIGUITY
  • 51. 50 Summary of Solutions • query cleaning – correct various types of spelling errors • query autocompletion – prevent spelling errors. • query refinement – making queries more specific, returning fewer results. • query rewriting – making queries more general / on-topic, returning more relevant results. • query forms – enabling users to specify precise queries FRONTEND AMBIGUITY
  • 52. 51 Query Autocompletion Problem Space Dimensions showing keywords vs. showing results single keyword vs. multiple keyword exact matching vs. fuzzy matching QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
  • 53. 52 Problem Space Dimensions showing keywords vs. showing results single keyword vs. multiple keyword exact matching vs. fuzzy matching Error-Tolerating Autocompletion (chaudhuri sigmod 09) desr desert dessert deserve QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
  • 54. 53 n c ae x Error-Tolerating Autocompletion (chaudhuri sigmod 09) data contains “search”, “sand” and “text” max. edit distance = 1 no input input: s input: se input: sen s a r t e t h d n c ae x s a r t e t h d n c ae x s a r t e t h d n c ae x s a r t e t h d Showing results instead of keywords can be achieved by associating inverted lists to trie nodes. trie QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
  • 55. 54 Tastier(li vldbj 11) Problem Space Dimensions showing keywords vs. showing results single keyword vs. multiple keyword exact matching vs. fuzzy matching “have a nni” show results for “have a nice day” QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
  • 56. 55 Tastier(li vldbj 11) • Trie-based (similar as previous paper). – Trie leaf nodes are associated with inverted lists. • To handle multiple keywords: – Each record/document is associated with a sorted lists of words in it (forward lists). • so that a binary search can determine whether a string appears in a record/document as a prefix. • why not hash? Because we need to match prefix, not whole words. • Inverted list intersections are computed incrementally using cache for improved efficiency. “have a nice day” “a, day, have, nice” example forward list QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
  • 57. 56 Phrase Prediction(nandi vldb 07) Problem Space Dimensions showing keywords vs. showing results single keyword vs. multiple keyword exact matching vs. fuzzy matching QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY a nice have a nice day
  • 58. 57 Phrase Prediction(nandi vldb 07) • Suggest phrases given the user input phrase. – Need to find a good length of a suggested phrase. • Too short: utility is small. • Too long: low chance of being accepted. • (modified) suffix tree-based. – Each node is a word, rather than a letter. – Why not use trie: phrases have no definitive starting point. A phrase may start in the middle of a sentence (i.e., start at a suffix of the sentence), hence suffix tree. • Significant phrases. laptop have a nice day QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
  • 59. 58 Summary of Solutions • query cleaning – correct various types of spelling errors • query autocompletion – prevent spelling errors. • query refinement – making queries more specific, returning fewer results. • query rewriting – making queries more general / on-topic, returning more relevant results. • query forms – enabling users to specify precise queries FRONTEND AMBIGUITY
  • 60. 59 Query Refinement • Motivation – Some under-specified queries on large data corpus have too many results. – Ranking cannot always be perfect. • Approaches – Identifying important terms in results (structured/unstructured) – Clustering results (structured/unstructured) – Faceted search (structured) FRONTEND AMBIGUITY QUERY REFINEMENT
  • 61. 60 Using Clustered Results (liu pvldb 11) All suggested queries are about programming language. It is desirable to refine an ambiguous query by its distinct meanings. “Java” FRONTEND AMBIGUITY QUERY REFINEMENT
  • 62. 61 • → Input: clustered results – clustering method is irrelevant. – e.g., the result of “Java” may have 3 clusters corresponding to Java language, Java island, and Java tea. • ← Output: one refined query for each cluster. Each refined query: – maximally retrieves the results in its cluster (recall) – minimally retrieves the results not in its cluster (precision) Using Clustered Results (liu pvldb 11) FRONTEND AMBIGUITY QUERY REFINEMENT
  • 63. 62 Using Important Terms in Results (tao edbt 09) • For relational data only. • Given a keyword query, it outputs top-k most frequent non-keyword terms in the results, without generating the results. – Avoiding result generation is possible since the terms are ranked only by frequency: tradeoff of quality and efficiency. Data Clouds (for structured data): Koutrika EDBT 09 (more sophisticated term ranking, but needs to generate query results first.) related FRONTEND AMBIGUITY QUERY REFINEMENT
  • 64. 63 Faceted Search all location: Sunnyvale, CA location: Phoenix, AZ location: Amherst, MA department: data management department: machine learning 1. How to select facets and facets conditions at each level, to minimize the user’s expected navigation cost? 2. How to rank facets and facets conditions? challenges Chakrabarti SIGMOD 04 Kashyap CIKM 10 …… …… …… FRONTEND AMBIGUITY QUERY REFINEMENT
  • 65. 64 Summary of Solutions • query cleaning – correct various types of spelling errors • query autocompletion – prevent spelling errors. • query refinement – making queries more specific, returning fewer results. • query rewriting – making queries more general / on-topic, returning more relevant results. • query forms – enabling users to specify precise queries FRONTEND AMBIGUITY
  • 66. 65 Query Rewriting • Motivation – Synonyms, alternative names: “green card” vs “permanent residency”. – Too specific: “MS Office 2007 for Mac x64 edition” – Non-quantitative: “small laptop” • Approaches – Using query/click logs – Finding rewriting rules from missing results • e.g., replace “green card” with “permanent residency”. – Using “differential queries” FRONTEND AMBIGUITY QUERY CLEANING
  • 67. 66 Using Query and Click Logs (cheng icde 10) The availability of query and click logs can be used to assess ground truth. query Q query log click log synonyms hypernyms hyponyms of Q “query” “search” synonym “MySQL” “database” hypernym “database” “MySQL” hyponym find and return historical queries whose “ground truth” (via click log) significantly overlaps with top-k results of Q. idea FRONTEND AMBIGUITY QUERY CLEANING
  • 68. 67 Automatic Suggestion of Rewriting Rules from Missing Results (bao sigir 12) • Challenges for automatically generating rewriting rules: – rules should be semantically natural. – a new rule designed for one query may eliminate good results of another query. FRONTEND AMBIGUITY QUERY CLEANING “green card” result d is missing / should be ranked higher result d contains phrase “permanent residency” rewriting rule: green card → permanent residency
  • 69. 68 → Input: query q, missed desirable results d ← Output: selected set of rules Generate candidate rules L → R. • L: n-grams in q. • R: n-grams in high- quality fields of d. Identify semantically natural rules by machine learning. Greedily select a subset of rules that maximizes the overall query quality. Automatic Suggestion of Rewriting Rules from Missing Results (bao sigir 12) FRONTEND AMBIGUITY QUERY CLEANING green card → permanent residency green card → federal government
  • 70. 69 Keyword++ (Entity Databases) (xin pvldb 10) “small IBM laptop” ID Product Name BrandName Screen Size Description 1 ThinkPad E545 Lenovo 15 The IBM laptop...small business… 2 ThinkPad X240 Lenovo 12 This notebook... To “understand” a term, compare two queries that differ on this term, and analyze the differences of attribute value distributions in the results. idea e.g., to understand term “IBM”, we can compare the results of “IBM laptop” vs. “laptop”. FRONTEND AMBIGUITY QUERY CLEANING
  • 71. 70 Suppose: “IBM laptop” → 50 results, 30 having “brand: Lenovo” “laptop” → 500 results, only 50 having “brand: Lenovo” The difference on “brand: Lenovo” is significant, reflecting the meaning of “IBM”. IBM brand: Lenovo small order by size ASC Offline: compute the best mapping for all terms in query log Online: compute the best segmentation of the query (DP). “laptop” “small laptop” likewise: Keyword++ (Entity Databases) (xin pvldb 10) FRONTEND AMBIGUITY QUERY CLEANING
  • 72. 71 Summary of Solutions • query cleaning – correct various types of spelling errors • query autocompletion – prevent spelling errors. • query refinement – making queries more specific, returning fewer results. • query rewriting – making queries more general / on-topic, returning more relevant results. • query forms – enabling users to specify precise queries FRONTEND AMBIGUITY
  • 73. 72 Offline: how many query forms, and which query forms, should be generated? • Too many – hard to find the relevant forms. • Too few – limiting query expressiveness. Online: how to identify query forms relevant to users’ search needs? Query Forms Enabling users to issue precise structured queries without mastering structured query languages. advantage challenges Baid SIGMOD 09 Jayapandian PVLDB 08 Ramesh PVLDB 11 Tang TKDE 13 FRONTEND AMBIGUITY QUERY FORMS
  • 74. Serving User Queries at Front End (52) 1. Ambiguity (29) 2. Ranking (3) 3. Representation (6) 4. Expert Search (6) 5. Privacy (8)
  • 75. 74 2. Ranking Ranking Method Categories Unstructured Data • represents queries and documents using vectors • each component is a term; the value is its weight • ranking score = similarity (query vector, result vector) Structured Data • a document → a node or a result (subgraph/subtree) vector space model proximity based ranking … authority based ranking … FRONTEND RANKING
  • 76. 75 2. Ranking Ranking Method Categories Unstructured Data • proximity of keyword matches in a document can boost its ranking. Structured Data • weighted tree/graph size, total distance from root to each leaf, semantic distance, etc. vector space model … authority based ranking … proximity based ranking FRONTEND RANKING
  • 77. 76 2. Ranking Ranking Method Categories vector space model … … Unstructured Data • nodes linked by many other important nodes are important. Structured Data • authority may flow in both directions of an edge • different types of edges in the data (e.g., entity-entity edge, entity-attribute edge) may be treated differently. proximity based ranking authority based ranking FRONTEND RANKING
  • 78. Serving User Queries at Front End (52) 1. Ambiguity (29) 2. Ranking (3) 3. Representation (6) 4. Expert Search (6) 5. Privacy (8)
  • 79. 78 3. Representation • Enterprise corpus can be much more heterogeneous than a collection of documents or web pages. • Different searches may have different types: retrieving a document, a figure, a tuple, a subgraph, analytical keyword queries, etc. Result diversification Result summarization Result differentiation solutions FRONTEND REPRESENTATION
  • 80. 79 Result Diversification • Result diversification is essentially the same problem as query refinement. – e.g., Java → Java language, Java tea, Java island. • Same techniques apply. FRONTEND REPRESENTATION DIVERSIFICATION
  • 81. 80 Result Summarization • Unstructured data: lots of work on text summarization in machine learning, natural language processing and IR communities. • Structured data: – Size-l object summary (Relational) – Result snippet (XML) Das, CMU 07 (unpublished) Nenkova, Mining Text Data 12 surveys FRONTEND REPRESENTATION SUMMARIZATION
  • 82. 81 Size-l Object Summary (fakas pvldb 11) ……Mike…… first window “Mike” unstructured Mike paper paper patent patent… conference John … … … … … … ? structured FRONTEND REPRESENTATION SUMMARIZATION
  • 83. 82 Size-l Object Summary (fakas pvldb 11) • Each tuple has: – a static importance score. • similar idea as PageRank – a run-time relevance score. • distance to result root • connectivity properties to result root • Objective: find a connected snippet of the result, which consists of l tuples and has the maximum score. • Dynamic programming based solution. Result snippet for XML: Liu TODS 10 related FRONTEND REPRESENTATION SUMMARIZATION
  • 84. 83 Result Differentiation Result 1 Result 2 event: year 2000 2012 paper: title OLAP data mining cloud scalability search “NEC Labs Open House” result 1: a large table with many people / papers / posters result 2: a large table with many people / papers / posters … results result differentiation vs. comparing different credit cards on a bank website: only with pre-defined features. FRONTEND REPRESENTATION DIFFERENTIATION
  • 85. 84 4. Expert Search documents in which a candidate and a topic co-occur topics near a candidate in a document problem solving / ticket routing history user’s knowledge on a topic • expert should be more knowledgeable social relationship between expert and user • problem solving is usually more effective if expert has a close social relationship with user external corpus • many employees publish stuff externally, i.e., papers, blogs. ways for judging an expert Find an expert within an enterprise to solve a particular problem. goal FRONTEND EXPERT SEARCH
  • 86. 85 Classical Methods • Builds a feature vector for each expert using various evidence • Ranks experts based on query, using traditional retrieval models candidate model • First finds documents related to query, then locates experts in documents • Mimics the process a human takes. document model Balog CIKM 08 survey FRONTEND EXPERT SEARCH
  • 87. 86 User-Oriented Model (smirnova ecir 11) Users prefer experts who: are more knowledgeable than themselves. knowledge gain: p(e|q) – p(u|q) have a close social relationship with themselves. time-to-contact: shortest path department head John employees … e = expert u = user FRONTEND EXPERT SEARCH
  • 88. 87 Using Web Search Engine (santos inf. process. manage. 11) query q result from intranet web query q’ result from internetformulate web query search intranet corpus combine candidate’s full name: “Jeff Smisek” organization’s name: “IBM” terms in q: “data integration” excluding results from organization: “-site:ibm.com” FRONTEND EXPERT SEARCH
  • 89. 88 Ticket Routing (shao kdd 08) new ticket: DB2 login failure transferred to group A transferred to group B transferred to group C resolved How to find the best group and reduce problem solving time? Markov chain model Using only previous routing history (not ticket content) FRONTEND EXPERT SEARCH
  • 90. 89 Ticket Routing (shao kdd 08) Pr(g|S) possibility to route a ticket to group g given previous groups S Pr(g|S) includes the probability that: • g can solve the ticket • g can correctly re-route the ticket. Train the Markov chain model from ticket routing history. FRONTEND EXPERT SEARCH
  • 91. Serving User Queries at Front End (52) 1. Ambiguity (29) 2. Ranking (3) 3. Representation (6) 4. Expert Search (6) 5. Privacy (8)
  • 92. 91 5. Privacy It is sometimes desirable that the search engine doesn’t know which documents a user wants to retrieve. • For users: privacy • For enterprises: avoiding liability user privacy While a search engine answers individual keyword searches, there are methods that perform multiple searches and, from the answers, piece together aggregate information about underlying corpus. • Enterprises may not want to disclose such information to all users. data privacy
  • 93. 92 User Privacy Private Information Retrieval (PIR) • old topic, tons of theoretical papers Modifying search engine. e.g., • forcing it to forget user activities • embellishing queries with decoy terms (Pang PVLDB 10) Using ghost queries to obfuscate user intention (Pang ICDE 12) • no change to search engine • light-weight solutions It is sometimes desirable that the search engine doesn’t know which documents a user wants to retrieve. • For users: privacy • For enterprises: avoiding liability user privacy
  • 94. 93 Private Information Retrieval (PIR) • Idea: retrieve more documents than needed. • NaĂŻve: retrieve the entire corpus. • How to minimize the number of retrieved & unneeded documents? • Tons of theoretical papers on different variations of the problem, e.g., – different computation power of the search engine – different number of non-communicating corpus replica. Gasarch EATCS Bulletin 2004 survey
  • 95. 94 Ghost Queries (pang icde 12) • Challenges – Generate ghost queries on topics different from user’s topics of interest, and make it difficult for the search engine to infer user’s topics. – Ghost queries need to be meaningful/realistic, so that they cannot be easily identified. generate ghost queries ghost queries discard ghost query results results submit to search engine user query
  • 96. 95 Ghost Queries (pang icde 12) • (e1, e2) privacy model – Given a user query, if the probability of a topic increases more than e1, it should be reduced to below e2 by the ghost queries. • Topics are predefined. • A ghost query must be coherent: all words in the ghost query should describe common or related topics. • Randomized algorithm based solution.
  • 97. 96 Data Privacy While a search engine answers individual keyword searches, there are methods that perform multiple searches and, from the answers, piece together aggregate information about underlying corpus. • Enterprises may not want to disclose such information to all users. data privacy inserting dummy tuples OR randomly generating attribute values • only applicable to structured data disallowing certain queries OR return snippets • search quality loss altering a small number of results: adding dummy results; modifying results, hiding some results (Zhang SIGMOD 12) solutions FRONTEND PRIVACY
  • 98. 97 Aggregate Suppression (zhang sigmod 12) • Example: consider corpus A and B. – A: n documents – B: 2n documents – A ⊂ B • Goal: suppress COUNT(*), i.e., adversary cannot tell which corpus is larger. • NaĂŻve approach 1: deterministically remove n documents from B. – achieves the goal, but with search utility loss: those n documents can never be retrieved. • NaĂŻve approach 2: randomly drop half of the results at run time. – no search utility loss, but fails to achieve the goal: a clever adversary can still get the information. FRONTEND PRIVACY
  • 99. 98 Aggregate Suppression (zhang sigmod 12) • Algorithm ideas – carefully adjusting query degree (number of documents matched by a query) and document degree (number of queries matching a document) by document hiding at run-time. – decline a query if its result can be covered by a small number of previous queries. Return previous query results instead. FRONTEND PRIVACY
  • 100. 99 Backend Collect data Analyze data Store and index data Admin System performance Search quality control/improvement Admin System performance Search quality control/improvement Frontend Interpret user query Search index Present results Interact with user index Data source Tutorial Outline
  • 101. 100 Enterprise Search Administrators • Main responsibilities – Care and feeding of an enterprise search solution • Monitor intranet help inboxes and respond to requests. • Assist in troubleshooting intranet issues for content contributors • Core skills required – Understand general corporate business processes – Experience in coordinating activities and managing relationships • with employees, content administrators, stakeholders, IT teams and external agencies Search Admin Search administrators ≠≠≠≠ IR experts Key Observation Admin Overview
  • 102. 101 What a Search Administrator Need? Bad results for query … I’m missing the golden URL… Result 22 should be ranked much higher! Enterprise Users Query Logs Query “global campus” seems unsatisfying • Understand overall search quality • Overall trend • YOY change • By segmentation • Understand individual search results • Why certain result is or isn’t brought back • Its ranking • Maintain search quality • Underlying data evolves • Terminology changes • Policy/Business Process changes • Organization changes • Hot topics Search Admin Admin Overview
  • 103. 102 Understand Search Quality 102 (Google Search analytics) Admin Examples
  • 104. 103 Understand Search Quality (Google Search analytics) Admin Examples
  • 105. 104 What a Search Administrator Need? Bad results for query … I’m missing the golden URL… Result 22 should be ranked much higher! Enterprise Users Query Logs Query “global campus” seems unsatisfying • Understand overall search quality • Overall trend • YOY change • By segmentation • Understand individual search results • Why certain result is or isn’t brought back • Its ranking • Maintain search quality • Underlying data evolves • Terminology changes • Policy/Business Process changes • Organization changes • Hot topics Search Admin Admin Examples
  • 106. 105 Gumshoe Search Quality Toolkit 105 (bao cikm 12) Admin Examples
  • 107. 106 Gumshoe Search Quality Toolkit 106 (bao cikm 12) Understand individual query Admin Examples
  • 108. 107 Gumshoe Search Quality Toolkit 107 (bao cikm 12) Examine search results Admin Examples
  • 109. 108 Gumshoe Search Quality Toolkit 108 (bao cikm 12) Understand why a result is returned Admin Examples
  • 110. 109 Gumshoe Search Quality Toolkit 109 (bao cikm 12) Understand the ranking of the result Admin Examples
  • 111. 110 Gumshoe Search Quality Toolkit 110 (bao cikm 12) Investigate a desired result Admin Examples
  • 112. 111 Gumshoe Search Quality Toolkit 111 (bao cikm 12) Suggest rewrite rules Admin Examples
  • 113. 112 Gumshoe Search Quality Toolkit 112 (bao cikm 12) Edit runtime rules Admin Examples
  • 114. Enterprise Search in the Big Data Era Case Study: IBM Intranet Search
  • 115. 114 Experience at IBM Internal Search • IBM deployed a commercially available search engine – Implementing standard IR techniques • Search quality went down over time to the point that Search results were unacceptable! Success (≥ 1 relevant results): 14% on top-1, 23% on top-5, 34% on top-50! [Zhu et al., WWW’07] So, they implemented various solutions… To the administrators managing the engine, exposed control knobs were insufficient Case Study Background
  • 116. 115 Attempts to Improve Search • Enhanced link analysis by incorporating external links to/from external WWW • Creative hacks: added fake terms to documents & queries – # terms per document determined by “popularity”: how much TF increase required for needed rank boost ? • Hard-coded custom results for the top 1200+ queries Didn’t help… Quality went down! Maintenance nightmare: Heuristic needs to be updated upon each nontrivial change in term stats./ranking parameters Even bigger nightmare! How to deal with continuously changing terminology? Case Study Background
  • 117. 116 Goals of Gumshoe Network Station Manager search Thin Client ManagerProduct names change: Continually changing terminology Domain-specific meaning Paula Summa search bring Paula Summa from employee directories per diem search Domain-specific repetitions popcorn search conference call! • Result 1: IBM Travel: Per Diem • Result 2: IBM Travel: Per Diem Rates • Result 3: IBM Travel: National perdiems • Result 25: IBM Travel: Per Diem Policy … Gumshoe: • Generic search solution, customizable & maintainable in many domains – Simple customization with reasonable effort – Ongoing search-quality management • Philosophy: programmable search Case Study Background
  • 118. 117 Programmable Search: Main Idea • Goals: – Transparency • Know “precisely” why every result item is being brought back • Understand how changes in content/intents affect search – Maintainability and “Debugability” • Ranking logic is guided by explicit rules • Properly react to changes in content/intents • Building blocks: – Deep analytics on documents – Domain-specific analysis of queries – Transparent customizable rule-driven ranking runtime rules backendbackend analytics interpretations Case Study Background
  • 119. 118 Distributed Analytics Platform (IBM InfoSphere BigInsights) Crawling, information extraction, token generation (TG), indexing Search runtime Index Index and rule update services backendbackend analytics runtime rulesinterpretations backend frontend Implementation Architecture Case Study Background
  • 120. 119 Backend Analytics: 3 Parts Local Analysis (per-page analysis) Global Analysis (cross-page analysis) Token Generation (TG) index Case Study Background
  • 121. 120 Local Analysis • Categorizing pages – Label pages by custom categories • IBM examples: HR, person, IT help, ISSI, sales information, marketing, corporate standards, legal & IP-law, … – Geo classification • Associate documents with the relevant countries & regions • Annotating pages – Identify HomePage annotation for people, projects, communities, … Simply knowing where a page is physically hosted is not enough (example: Czech Republic hosts all pages for IBM in Europe) Case Study Backend Local Analysis
  • 122. 121 • Declarative approach – Define an operator for each basic operation • Input tuple of annotations • Output tuples of annotations – Compose operators to build complex extractors • Algebraic expression • One document at a time trivial parallelism. • Benefits of declarative approach: – Expressivity: Richer, cleaner rule semantics – Performance: Better performance through optimization Declarative IE System Case Study Backend Local Analysis
  • 123. 122 InfoSphere Streams Cost-based optimization ... SystemT – Overview InfoSphere BigInsights SystemT RuntimeSystemT Runtime Input Documents Extracted Objects SystemTSystemT IBM Engines UIMA SystemT Highly embeddable runtime AQL Extractors Embedded machine learning model AQL Rules create view SentimentForCompany as select T.entity, T.polarity from classifyPolarity (SentimentFeatures ) T; create view Company as select ... from ... where ... create view SentimentFeatures as select from ; Case Study Backend Local Analysis
  • 124. 123 G J Chaitin Home Page Homepage Identification Title Extraction Matching titleMatching title patterns Title s Dictionary Match Home Page for G J Chaitin • http://w3.ibm.com/hr/idp/ • http://w3-03.ibm.com/isc/index.html • http://chis.at.ibm.com/ URL Extraction URLs Matching URLMatching URL patterns Homepage for: idp isc chis Employee directory … many more … Intranet page [Zhu et al., WWW’07] Case Study Backend Local Analysis
  • 125. 124124 IBM Confidential124 IBM Confidential Among the 38 pages with the exact same title, which is the best for “Paula Summa”? Role of Global Analysis Case Study Backend Global Analysis
  • 126. 125 Person Title Token Generation (TG) Annotated values Index content Ching-Tien T. (Howard) Ho Ho Ching-Tien Tien Ho Ho, Tien Howard Ho Ching-Tien H. ... Global Technology Services TG Howard Ho Ching Tien ... gts Global Technology Services Global Technology Technology Services Global Technology ... GlobalTechnologyServices nGramTG spaceTG …… … … … Case Study Backend Token Generation
  • 127. 126 3 Phases of Runtime Flow Search Query Phase 1: Query Semantics • Rewrite rules • Query interpretation Phase 2: Relevance Ranking By relevance buckets + conventional IR Phase 3: Result Construction • Grouping rules • Re-ranking rules Case Study Frontend
  • 128. 127 Phase 3: Result Construction Phase 2: Relevance Ranking Phase 1: Query Semantics query search rewrite rules queries interpretations partially ordered interpretations interpretations execution partially ordered results result aggregation ordered results grouping rules ordered & grouped results final results re-ranking rules Runtime Flow in More Details Case Study Frontend
  • 129. 128 Runtime Rules: Pattern-Action Language (Fagin 2012) Query Pattern Queries Matching Possible Action EQUALS [r=ibm|information|info] [d=COUNTRY] • ibm germany • info india Rewrite into “[country] hr” (e.g., germany hr) ENDS_WITH installation • acrobat installation • db2 on aix installation Replace installation with ISSI (e.g., acrobat ISSI) CONTAINS directions to [d=SITE] • driving directions to almaden • directions to watson from jfk Pages of “siteserv” category should be ranked higher STARTS_WITH [d=PERSON] • john kelly biography • steve mills announcement Group together pages that represent blog entries Pattern expression, matched against the keyword query Perform when matchQuery pattern →Action • Similar to the query-template rules of Agarwal et al. [WWW 2010] Query SemanticsCase Study Frontend
  • 130. 129129 What’s Best for Benefits? Query SemanticsCase Study Frontend
  • 131. 130130 The most important IBM page for benefits changes over time: currently it is netbenefits What’s Best for Benefits? Query SemanticsCase Study Frontend
  • 132. 131 Rewrite Rules benefits netbenefits Query SemanticsCase Study Frontend
  • 133. 132 Rewrite Rules benefits netbenefits interpretations partially ordered interpretations interpretations execution partially ordered results result aggregation ordered results grouping rules ordered & grouped results final results re-ranking rules benefits, netbenefits benefits netbenefits rewrite rules queries benefits search Query SemanticsCase Study Frontend
  • 134. 133 133 IBM Confidential People with first name Jim How can we avoid pages from people category? java jim Complex Rules Query SemanticsCase Study Frontend
  • 135. 134 134 IBM Confidential Complex Rules java jim and not in person category Query SemanticsCase Study Frontend
  • 136. 135 135 IBM Confidential Complex Rules java jim and not in person category interpretations execution partially ordered results result aggregation ordered results grouping rules ordered & grouped results final results re-ranking rules interpretations partially ordered interpretations rewrite rules queries java search Query SemanticsCase Study Frontend
  • 137. 136 InterpretationsScenario: An IBM employee wants to download Lotus Symphony 1.3 Runtime interpretation: download symphony 1.3 category=issi software=symphony 1.3 Query SemanticsCase Study Frontend
  • 138. 137 IBM Confidential Complex Rules java jim and not in person category interpretations execution partially ordered results result aggregation ordered results grouping rules ordered & grouped results final results re-ranking rules interpretations partially ordered interpretations rewrite rules queries java search Query SemanticsCase Study Frontend
  • 139. 138 3 Phases of Runtime Flow Search Query Phase 1: Query Semantics • Rewrite rules • Query interpretation Phase 2: Relevance Ranking By relevance buckets + conventional IR Phase 3: Result Construction • Grouping rules • Re-ranking rules Relevance RankingCase Study Frontend
  • 140. 139 Person Title Recall: Token Generation (TG) Annotated values Index content Ching-Tien T. (Howard) Ho Global Technology Services TG Howard Ho Ching Tien ... gts Global Technology Services Global Technology Technology Services Global Technology ... GlobalTechnologyServices nGramTG spaceTG …… … … … Ho Ching-Tien Tien Ho Ho, Tien Howard Ho Ching-Tien H. ...Person + personNameTG Person + nGramTG Title + acronymTG Title + spaceTG Title + nGramTG Relevance RankingCase Study Frontend
  • 141. 140 Annotation + TG Relevance Bucket Howard Ho Ching Tien ... GlobalTechnologyServices …… Person + personNameTG Person + nGramTG Title + acronymTG Title + spaceTG Title + nGramTG query search Relevance buckets •Buckets are ranked – Based on annotation type – Based on TG quality •A page can belong to multiple buckets •Within each bucket, ranking is by conventional IR …… Relevance RankingCase Study Frontend
  • 142. 141 Ranking by Relevance Buckets grouping rules ordered & grouped results final results re-ranking rules interpretations partially ordered interpretations rewrite rules queries interpretations execution partially ordered results result aggregation ordered results employment verification search Relevance RankingCase Study Frontend
  • 143. 142 3 Phases of Runtime Flow Search Query Phase 1: Query Semantics • Rewrite rules • Query interpretation Phase 2: Relevance Ranking By relevance buckets + conventional IR Phase 3: Result Construction • Grouping rules • Re-ranking rules Result ConstructionCase Study Frontend
  • 144. 143 Grouping Rules • Grouping rules define how search results should be grouped together • Search administrators can improve the diversity of search results (in 1st page) – Based on their familiarity with the data sources Group pages of the same category per diem travel, you-and-ibm ANY ISSI, IT Help Central, Forum, Bluepedia, Media Library, … Query pattern Result ConstructionCase Study Frontend
  • 145. 144 Need first page diversity Flooding with Similar Pages Result ConstructionCase Study Frontend
  • 146. 145145 IBM Confidential per diem travel, you-and-ibm Grouping Rule to the Rescue Result ConstructionCase Study Frontend
  • 147. 146146 IBM Confidential per diem travel, you-and-ibm final results re-ranking rules interpretations partially ordered interpretations rewrite rules queries interpretations execution partially ordered results result aggregation ordered results grouping rules ordered & grouped results per diem search Grouping Rule to the Rescue Result ConstructionCase Study Frontend
  • 148. 147 Re-ranking Rules • Re-ranking rules adjust ranking of search results based on categories • Example: search administrator specifies the important sources of “hot/current topics” Hot topics Rank these categories higher Bluepedia, News, About-IBM smarter planet, cloud computing, centennial, … Result ConstructionCase Study Frontend
  • 149. 148 Bluepedia Technical News Homepages of “About IBM” Hot topics Rank these categories higher Bluepedia, News, About-IBM smarter planet, cloud computing, centennial, … Re-ranking Rule for Hot Topics Result ConstructionCase Study Frontend
  • 150. 149 Re-ranking Rules for Person Queries [d=PERSON] executive_corner, media_library, organization_chart, files Result ConstructionCase Study Frontend
  • 151. 150150 IBM Confidential per diem travel, you-and-ibm final results re-ranking rules interpretations partially ordered interpretations rewrite rules queries interpretations execution partially ordered results result aggregation ordered results grouping rules ordered & grouped results per diem search Grouping Rule to the Rescue Result ConstructionCase Study Frontend
  • 152. 151 3 Phases of Runtime Flow Search Query Phase 1: Query Semantics • Rewrite rules • Query interpretation Phase 2: Relevance Ranking By relevance buckets + conventional IR Phase 3: Result Construction • Grouping rules • Re-ranking rules Case Study Frontend
  • 153. 152 What Administrators Need… • Search administrators have major problems with an opaque search engine • Programmable search provides – Customization to the specific domain – Ongoing search-quality management Allows the building of search quality toolkit. Recap: Case Study Admin
  • 154. 153 Gumshoe Search Quality Toolkit! Case Study Admin
  • 155. Demo
  • 157. 156 Proof of Pudding is in the Eating • Immediate Positive Impact within first 3 months – Improve natural clickthrough rate by 100%+ – Top 5 results: selected about 90% of the time • Sustained search quality Improvements 4 years since going alive • Stable natural search click through rate Gumshoe (Aug. 2011– Oct. 2011) Old Intranet Search (Aug. 2010– Aug. 2011) Natural clickthrough rate Case Study Results
  • 158. 157 Summary Programmable search: Simple & flexible customization Search quality management Backend Analytics Local analysis (per-page analysis) Global Analysis (cross-page analysis) Token Generation (TG) [Fagin et al., PODS’10, PODS’11] Tooling • Search provenance • Rule suggestion • Utilization of relevance buckets [Li et al., SIGIR’06, Zhu et al., WWW’07] Phase 1: Query Semantics • Rewrite rules • Query interpretation Phase 2: Relevance Ranking By relevance buckets + conventional IR Phase 3: Result Construction • Grouping rules • Re-ranking rules [ Bao et al., ACL’2010, SIGIR’2012 CIKM’2012] Case Study Summary
  • 159. Enterprise Search in the Big Data Era Future Directions
  • 160. 159 Search Engine Components Backend Collect data Analyze data Store and index data Admin System performance Search quality control/improvement Frontend Interpret user query Search index Present results Interact with user index Data source
  • 161. 160 Future Directions Data Heterogeneity A rich variety of data types need to be searched in enterprises. • docs, databases, images, videos, social graphs, etc. observations How to automatically identify relevant data types, and search and rank across different data types? • e.g., for image search, should image recognition techniques be incorporated in enterprise search engines? If so, how? questions
  • 162. 161 Future Directions Data Freshness New data is continuously collected and published in enterprises, the rate of which can be very fast. Web search engines are not required to index new websites quickly, but in enterprises, new contents may need to be searchable asap. observations How to build efficient real-time indexes to ensure data freshness in enterprise search? questions
  • 163. 162 Future Directions Search Context Enterprise search users have richer profiles than web users. • activities, bio, position, projects, experiences, etc. observations How to utilize users’ contexts to provide customized results? Is it possible to predict the information a user may want, and push it to the user? questions
  • 164. 163 Future Directions User Preference Different users in an enterprise have different expertise, and may prefer different ways to express queries. • e.g., some users prefer pure keyword search, while others may want lightly-structured queries. observations How to effectively satisfy different needs for expressing queries for different users? questions
  • 165. 164 Future Directions Question Answering The purpose of many enterprise searches are to find answers to questions. • e.g., what is the previous name of a product, and when did we change to the current name? observations Is it possible to effectively use natural language processing techniques and domain knowledge to automatically answer natural language questions? questions
  • 166. 165 Future Directions Transactional Search Over 1/3 of enterprise search queries is transactional. It will be desirable if enterprise search engines can recommend business processes to accomplish a certain task given a transactional search. • E.g., given a customer’s lengthy complaint letter, how to find out the departments relevant to the complaints. observations How to better support transactional search? How to initiate a business process based on the results of a search? questions
  • 167. 166 Future Directions Big Data Analytics Rich information and knowledge lies in big data. Many employees (not just data analysts) may benefit from the ability to perform analytics on the company’s big data. observations How to build a low-cost, interactive platform that allows a large number of employees to issue analytical queries? How to give employees the capabilities to analyze big data, if they have little knowledge of SQL or MapReduce programming? questions
  • 168. 167 Future Directions Tooling for Search Quality Maintenance Most enterprise search engines have to be manually evaluated and tuned by a search administrator with domain knowledge, in an ad-hoc fashion. observations Can we automate this process, or at least minimize manual involvement? Can we fully utilize explicit user feedbacks? • Explicit user feedbacks are easier to obtain in enterprise search, and there are less spams. questions
  • 169. Thanks. Acknowledgement: IBM Research: Sriram Raghavan, Fred Reiss, Shiv Vaithyanathan, Ron Fagin IBM CIO’s Office: Nicole Dri, Brian C. Meyer LogicBlox: Benny Kimelfeld* TripAdvisor: Adriano Crestani Campos* Facebook: Zhuowei Bao* NJIT: Yi Chen UNSW: Wei Wang * work done while at IBM