Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»

Efficiency Issues for Web Search
Engines:

How to Make a Search Engine Return Results
in a Hundred Milliseconds or Less?

Ismail Sengor Altingovde
L3S Research Center

Research interests
• Efficiency and scalability issues for Web-IR
• Social Web: Sentiment analysis, Social ranking
• Domain-specific search engines & focused crawling
• Web information extraction
• XML querying & searching
• Recommendation systems
• OCR & IR
• Web databases

Research interests: Today
• Efficiency and scalability issues for WebIR
– Caching
• Cost-aware result caching techniques [TWEB
2011, IPM2012]
• Alternative result cache organizations [ECIR 2011]
• Cache freshness [SIGIR 2011, ECIR 2012]
• Regionalization & caching [SPIRE 2012]
– Static Index Pruning
• Query views in pruning algorithms [TOIS 2012]
• Correctness guarantees for pruning [in progress]

Search: what really happens is...
Indexer-1 Inverted
Index
Broker
Query 3
1 2
4
8
Query Result 2 Indexer-2
User 3
5 4

7 5 7

Doc server-1 Doc server-2 …
Indexer-N
…
…
6 6
Documents

Data items related to search
• Data items
– Posting lists (fetched)
– Intersections of posting lists (computed)
– Query results as doc-ids (computed)
– Documents (fetched)
– Query results as pages (computed)!
all
‘ em
e
ach
C

Search: where caches come in?
Indexer-1 Inverted
Index
Broker
Query 3
1 2
4
8 List
Query Result Cache
2
User
5 Result 4 Indexer-2
57 Cache
7
Doc server-1 Doc server-2

…
Document …
Cache
6 6
Documents

Caching for Web search
• Cache content is according to
– Frequency
– Recency
• Cache types
– Static (for longer term access patterns)
– Dynamic (for shorter term access patterns)
– Hybrid

How about costs?
• Miss-costs are not “uniform”!
• Costs are inter-related!

• Both the caching strategies and the
evaluation should consider costs!

Search: caches and costs
Indexer-1 Inverted
Index
Broker Clist
Query 3
1 2
4 Crank
8
Query Result 2 Indexer-2
User Result 4 Clist
5
Cache
7 5 7
Crank
Doc server-1 Doc server-2 …
Indexer-n
…
C snip
Csnip …
6 Cdoc Cdoc
6
Documents Cost(q) = Clist + Crank + Cdoc + Csnip

Our contribution
• Cost-aware caching for Web search
– Single-level (result) caches
– Multi-level caches

• Costs are computed on-the-fly or
simulated
• Gain of caching an item
– Time cost to produce or fetch (Citem)
– Storage space (Sitem)

Motivating scenario: Single-level
• Result cache R
– Capacity(R) = 1 page

• Result pages for queries A and B
– Freq(A) = 10, Freq(B) = 20
– Cache result “B” for higher hit rate

• What if:
– Cost(A) = 100 ms, Cost(B) = 10 ms?
– Cache “A” for higher processing efficiency

Take costs into account while caching
(and evaluating)!

Motivating scenario: Multi-level
• Assume
– Freq(A) = 10, Freq(B) = 20,
– Cost(A) = 100ms, Cost(B) = 10 ms
– an additonal list cache L
– all terms in query A is cached in L
– new Cost(A) = 10 ms (dropped from 100ms!)
– now it is better to cache result B

Take cost interdependencies into account
in multi-level caches.

Cost-aware result caching (RC)
• Key idea: Embed query processing cost into the result
caching strategies
• Static caching
– Knapsack problem
• Query results have values and sizes
• Greedy solution: order in Value/Size

– MostFreq Strategy (baseline)
• Value(q) = Freq(q)
• Unit space per result

Cost-aware static RC
Static cost-aware caching strategies:
• FreqThenCost
– Sort first by Freq(q) and then Cost(q)
• StabilityThenCost
– Stabilityof the frequency in succeeding time intervals
• Freq&Cost
– Value(q) = Cost(q) x Freq(q)K , K>1
– Why Freq(q)K ?
• Queries with very low frequencies may disappear in the future

Cost-aware dynamic RC
• Baselines: LRU, LFU
• LCU: Least costly cached item is evicted
• LFCU_K: Least Frequently and Costly Used
– Cost(q) x Freq(q)K , K>1
• GDS: Greedy Dual Size [Cao&Irani, 1997]
– H = Cost(q) + L, L is age and set to the H value
of evicted item
• GDSF_K: Greedy Dual Size Frequency
– Cost(q) x Freq(q)K + L [Arlitt et al., 2000]

Performance: Static RC

Gains up to 3%!

Performance: Dynamic RC

Gains up to 6%!

Today’s talk

• Cost-aware caching strategies

• Cache invalidation

Motivation
• Higher cache capacity improves hit rate
• But results become stale [Cambazoglu et al. WWW2010]

Solutions from the literature

• Decoupled: Time-to-live (TTL)
– refresh stale results when backend is idle
[Cambazoglu et al., WWW’10]

q1 R1 TTL(q1)

q2 R2 TTL(q2)
qi
…

qk Rk TTL(qk)

Result Cache

Solutions from the literature

• Coupled: Cache invalidation policy (CIP)
[Blanco et al., SIGIR’10]
– Incremental index update
– Content changes sent to CIP module to invalidate
queries (offline)

CIP module
all queries in cache(s) all changes in the backend index

Our contribution

• Devise a new invalidation mechanism
– better than TTL and close to CIP in detecting
stale results
– better than CIP and close to TTL in efficiency
and practicality

Timestamp-based Invalidation

• The value of the TS on an item shows the
last time the item was updated
• TIF has two components:
– Offline (indexing time) : Decide on term and
document timestamps
– Online (query time): Decide on the staleness of the
query result

TIF Architecture
qi
SEARCH Document timestamps
NODE
TS(d1) TS(d2) … TS(dD)

q1 R1 TS(q1) Invalidation
0/1
logic
document TS
q2 R2 TS(q2) TS(t1)
t1 updates
qi, Ri, TS(qi)
TS(t2)
t2
…
miss/stale Doc.
…
… …
index parser
qk Rk TS(qk)
results TS(tT)
tT updates

Result cache

term TS updates

documents assigned
to the node

TS Update Policies: Documents

• For a newly added document d
– TS(d) = now()
• For a deleted document d
– TS(d) = infinite
• For an updated document d
– if diff(dnew, dold) > L
TS(d) = now()
– diff(di, dj): |length(di) – length(dj)|

TS Update Policies: Terms

• Frequency based update
t TS(t) = T0, PLLTS= 5

t

Number of added postings > F x PLLTS TS(t) = now()
PLLTS= 6

TS Update Policies: Terms
• Score based update

t p1 p2 p3 p4 p5

sort w.r.t. scoring function

p4 p3 p2 p5 p1 TS(t) = T0, STS = Score(p3)

t p1 p2 p3 p4 p5 p6

Score of added posting > STS TS(t) = now()
STS = re-sort & compute

Result Invalidation Policy

• A search node decides a result stale if:
– C1: ∃d ϵ R, s.t. TS(d) > TS(q)
(d is deleted or revised after the generation of query
result)
or,
– C2: ∀t ϵ q, s.t. TS(t) > TS(q)
(all query terms appeared in new documents
after the generation of query result)
• Also apply TTL to avoid stale accumulation

Simulation setup
• Data: English wikipedia dump
– snapshot at Jan 1, 2006 ≈ 1 million pages
– All add/deletes/updates for following 30 days
• Queries: 10,000 from AOL log

Simulation setup

• Evaluation metrics [Blanco 2010]
– The query result is updated if two top-10 lists
are not exactly the same

Redundant query executions
False Positive Ratio =
Number of unique queries

Stale results returned
Stale Traffic Ratio =
Number of query occurrences

Performance: all queries

Frequency-based term TS update Score-based term TS update

Discussion

TIF CIP
Data Send <q, R, TS(q)> to Send all <q, R> to CIP
transfer the search nodes Send all docs to CIP

Invalidation Traverse the query index
operations Compare TS values
for every document

Conclusion & Future work

• Data on the Web is growing continuosly
– Search efficiency is crucial!
– We present strategies for improving the performance
of Web search engines
• Cost aware strategies improve efficiency
• Practical invalidation methods with good accuracy
• Upcoming work on efficiency:
– New strategies for cache freshness & index
pruning!

References: Our work
• [SIGIR 2011] Sadiye Alici, Ismail Sengör Altingövde, Rifat Ozcan, Berkant Barla
Cambazoglu, Özgür Ulusoy: Timestamp-based result cache invalidation for web
search engines. SIGIR 2011: 973-982
• [ECIR 2012] Sadiye Alici, Ismail Sengör Altingövde, Rifat Ozcan, Berkant Barla
Cambazoglu, Özgür Ulusoy: Adaptive Time-to-Live Strategies for Query Result
Caching in Web Search Engines. ECIR 2012: 401-412
• [TOIS 2102] Ismail Sengör Altingövde, Rifat Ozcan, Özgür Ulusoy: Static index
pruning in web search engines: Combining term and document popularities with
query views. ACM Trans. Inf. Syst. 30(1): 2 (2012)
• [TWEB 2012] Rifat Ozcan, Ismail Sengör Altingövde, Özgür Ulusoy: Cost-Aware
Strategies for Query Result Caching in Web Search Engines. TWEB 5(2): 9 (2011)
• [SPIRE 2012] B. Barla Cambazoglu , Ismail Sengör Altingövde: Impact of
Regionalization on Performance of Web Search Engine Result Caches. (to appear)
• [IPM 2012] Ozcan, I. S. Altingovde, B. B. Cambazoglu, F. P. Junqueira, Ö. Ulusoy:
Five-level Static Cache Architecture for Web Search Engines, IPM, to appear.
• [ECIR 2011] Ismail Sengör Altingövde, Rifat Ozcan, Berkant Barla Cambazoglu,
Özgür Ulusoy: Second Chance: A Hybrid Approach for Dynamic Result Caching in
Search Engines. ECIR 2011: 510-516arch engine caching. WWW 2010: 181-190

Other References
• [Blanco et al., SIGIR 2010] Roi Blanco, Edward Bortnikov, Flavio
Junqueira, Ronny Lempel, Luca Telloli, Hugo Zaragoza: Caching
search engine results over incremental indices. SIGIR 2010: 82-89
• [Cambazoglu et al., WWW 2010] Berkant Barla Cambazoglu,
Flavio Paiva Junqueira, Vassilis Plachouras, Scott A. Banachowski,
Baoqiu Cui, Swee Lim, Bill Bridge: A refreshing perspective of
search engine caching. WWW 2010: 181-190

Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (8)

Semelhante a Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»

Semelhante a Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем» (20)

Mais de Yandex

Mais de Yandex (20)

Último

Último (20)

Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»