Научно-технический семинар спикеров RuSSIR 2012 Чирага Шаха и Исмаила Сенгор Алтинговде в московском офисе Яндекса, 3 августа 2012.
Исмаил Сенгор Алтинговде, ведущий научный сотрудник в Исследовательском центре L3S в Ганновере, Германия.
1. Efficiency Issues for Web Search
Engines:
How to Make a Search Engine Return Results
in a Hundred Milliseconds or Less?
Ismail Sengor Altingovde
L3S Research Center
2. Research interests
• Efficiency and scalability issues for Web-IR
• Social Web: Sentiment analysis, Social ranking
• Domain-specific search engines & focused crawling
• Web information extraction
• XML querying & searching
• Recommendation systems
• OCR & IR
• Web databases
3. Research interests: Today
• Efficiency and scalability issues for WebIR
– Caching
• Cost-aware result caching techniques [TWEB
2011, IPM2012]
• Alternative result cache organizations [ECIR 2011]
• Cache freshness [SIGIR 2011, ECIR 2012]
• Regionalization & caching [SPIRE 2012]
– Static Index Pruning
• Query views in pruning algorithms [TOIS 2012]
• Correctness guarantees for pruning [in progress]
4. Search: what really happens is...
Indexer-1 Inverted
Index
Broker
Query 3
1 2
4
8
Query Result 2 Indexer-2
User 3
5 4
7 5 7
Doc server-1 Doc server-2 …
Indexer-N
…
…
6 6
Documents
5. Data items related to search
• Data items
– Posting lists (fetched)
– Intersections of posting lists (computed)
– Query results as doc-ids (computed)
– Documents (fetched)
– Query results as pages (computed)!
all
‘ em
e
ach
C
6. Search: where caches come in?
Indexer-1 Inverted
Index
Broker
Query 3
1 2
4
8 List
Query Result Cache
2
User
5 Result 4 Indexer-2
57 Cache
7
Doc server-1 Doc server-2
…
Document …
Cache
6 6
Documents
7. Caching for Web search
• Cache content is according to
– Frequency
– Recency
• Cache types
– Static (for longer term access patterns)
– Dynamic (for shorter term access patterns)
– Hybrid
8. How about costs?
• Miss-costs are not “uniform”!
• Costs are inter-related!
• Both the caching strategies and the
evaluation should consider costs!
9. Search: caches and costs
Indexer-1 Inverted
Index
Broker Clist
Query 3
1 2
4 Crank
8
Query Result 2 Indexer-2
User Result 4 Clist
5
Cache
7 5 7
Crank
Doc server-1 Doc server-2 …
Indexer-n
…
C snip
Csnip …
6 Cdoc Cdoc
6
Documents Cost(q) = Clist + Crank + Cdoc + Csnip
10. Our contribution
• Cost-aware caching for Web search
– Single-level (result) caches
– Multi-level caches
• Costs are computed on-the-fly or
simulated
• Gain of caching an item
– Time cost to produce or fetch (Citem)
– Storage space (Sitem)
11. Motivating scenario: Single-level
• Result cache R
– Capacity(R) = 1 page
• Result pages for queries A and B
– Freq(A) = 10, Freq(B) = 20
– Cache result “B” for higher hit rate
• What if:
– Cost(A) = 100 ms, Cost(B) = 10 ms?
– Cache “A” for higher processing efficiency
Take costs into account while caching
(and evaluating)!
12. Motivating scenario: Multi-level
• Assume
– Freq(A) = 10, Freq(B) = 20,
– Cost(A) = 100ms, Cost(B) = 10 ms
– an additonal list cache L
– all terms in query A is cached in L
– new Cost(A) = 10 ms (dropped from 100ms!)
– now it is better to cache result B
Take cost interdependencies into account
in multi-level caches.
14. Cost-aware result caching (RC)
• Key idea: Embed query processing cost into the result
caching strategies
• Static caching
– Knapsack problem
• Query results have values and sizes
• Greedy solution: order in Value/Size
– MostFreq Strategy (baseline)
• Value(q) = Freq(q)
• Unit space per result
15. Cost-aware static RC
Static cost-aware caching strategies:
• FreqThenCost
– Sort first by Freq(q) and then Cost(q)
• StabilityThenCost
– Stabilityof the frequency in succeeding time intervals
• Freq&Cost
– Value(q) = Cost(q) x Freq(q)K , K>1
– Why Freq(q)K ?
• Queries with very low frequencies may disappear in the future
16. Cost-aware dynamic RC
• Baselines: LRU, LFU
• LCU: Least costly cached item is evicted
• LFCU_K: Least Frequently and Costly Used
– Cost(q) x Freq(q)K , K>1
• GDS: Greedy Dual Size [Cao&Irani, 1997]
– H = Cost(q) + L, L is age and set to the H value
of evicted item
• GDSF_K: Greedy Dual Size Frequency
– Cost(q) x Freq(q)K + L [Arlitt et al., 2000]
20. Motivation
• Higher cache capacity improves hit rate
• But results become stale [Cambazoglu et al. WWW2010]
21. Solutions from the literature
• Decoupled: Time-to-live (TTL)
– refresh stale results when backend is idle
[Cambazoglu et al., WWW’10]
q1 R1 TTL(q1)
q2 R2 TTL(q2)
qi
…
qk Rk TTL(qk)
Result Cache
22. Solutions from the literature
• Coupled: Cache invalidation policy (CIP)
[Blanco et al., SIGIR’10]
– Incremental index update
– Content changes sent to CIP module to invalidate
queries (offline)
CIP module
all queries in cache(s) all changes in the backend index
23. Our contribution
• Devise a new invalidation mechanism
– better than TTL and close to CIP in detecting
stale results
– better than CIP and close to TTL in efficiency
and practicality
24. Timestamp-based Invalidation
• The value of the TS on an item shows the
last time the item was updated
• TIF has two components:
– Offline (indexing time) : Decide on term and
document timestamps
– Online (query time): Decide on the staleness of the
query result
26. TS Update Policies: Documents
• For a newly added document d
– TS(d) = now()
• For a deleted document d
– TS(d) = infinite
• For an updated document d
– if diff(dnew, dold) > L
TS(d) = now()
– diff(di, dj): |length(di) – length(dj)|
27. TS Update Policies: Terms
• Frequency based update
t TS(t) = T0, PLLTS= 5
t
Number of added postings > F x PLLTS TS(t) = now()
PLLTS= 6
29. Result Invalidation Policy
• A search node decides a result stale if:
– C1: ∃d ϵ R, s.t. TS(d) > TS(q)
(d is deleted or revised after the generation of query
result)
or,
– C2: ∀t ϵ q, s.t. TS(t) > TS(q)
(all query terms appeared in new documents
after the generation of query result)
• Also apply TTL to avoid stale accumulation
30. Simulation setup
• Data: English wikipedia dump
– snapshot at Jan 1, 2006 ≈ 1 million pages
– All add/deletes/updates for following 30 days
• Queries: 10,000 from AOL log
31. Simulation setup
• Evaluation metrics [Blanco 2010]
– The query result is updated if two top-10 lists
are not exactly the same
Redundant query executions
False Positive Ratio =
Number of unique queries
Stale results returned
Stale Traffic Ratio =
Number of query occurrences
33. Discussion
TIF CIP
Data Send <q, R, TS(q)> to Send all <q, R> to CIP
transfer the search nodes Send all docs to CIP
Invalidation Traverse the query index
operations Compare TS values
for every document
34. Conclusion & Future work
• Data on the Web is growing continuosly
– Search efficiency is crucial!
– We present strategies for improving the performance
of Web search engines
• Cost aware strategies improve efficiency
• Practical invalidation methods with good accuracy
• Upcoming work on efficiency:
– New strategies for cache freshness & index
pruning!