SlideShare uma empresa Scribd logo
1 de 40
Baixar para ler offline
Low-cost Management of Inverted
Files for Online Full-Text Search
Giorgos Margaritis , Stergios V. Anastasiadis
Systems Research Group, University of Ioannina, Greece
CIKM ‘09, Hong Kong, November 2 - 6, 2009
Introduction
• Prices of hard disks drop
 Amount of stored data increases
 Need for automated search
• Full-text search
 User: defines terms and operands (e.g. logical)
 System: finds documents that contain terms and satisfy operands
 usually sorts returned documents by "relevance"
• Online full-text search
 Index updates concurrent with document searches
 Need to constantly maintain index up-to-date
Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
Inverted Index
• For each term store an inverted (or posting) list
 list of pointers to documents containing term
 pointers specify exact position in document (e.g. <doc, offset>)
 each pointer is called posting
• Lexicon
 associates terms to their inverted lists
November 2009 Giorgos Margaritis - University of Ioannina, Greece
Document 2Document 1
and
dark
did
had
……..
where
Lexicon Inverted List
. . .
Build time vs. Search time Trade-off
• Contiguous disk storage of inverted lists:
 improves access efficiency
 needs complex dynamic storage management
 requires frequent or bulky relocations
• List fragmentation
 improves build time
 degrades search time
Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
Inverted List Contiguity
• Recent methods
 Maintain a (relatively small) number of partial indexes
 Retrieve an inverted list: retrieve its fragments from all partial indexes
• Majority of inverted lists
 Disk size of hundreds of kilobytes
• Fragmentation can really harm retrieval times of small lists
 More than 99% of terms have list sizes < 10KB (426GB text collection)
 Typical disk: 12ms positional time + 0.1ms to sequentially read 10KBb
 N fragments N positional times N x (contiguous list retrieval time)
Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
Inverted List Contiguity
• Hybrid Logarithmic Merge [Büttcher et al., 2006]
 Create inverted index for 426GB
November 2009 Giorgos Margaritis - University of Ioannina, Greece
0
1
2
3
4
5
6
7
0
19
39
58
77
97
116
136
155
174
194
213
232
252
271
290
310
329
349
368
387
407
426
Gigabytes parsed
#Partialindexes
Design Objectives
• 1. Keep small lists contiguous
• 2. Keep large lists contiguous, or in few large fragments
• 3. Keep build time low
• 4. Retrieve lists fast
November 2009 Giorgos Margaritis - University of Ioannina, Greece
Outline
• Motivation
• Background
• Proposed Algorithm
• Experiments
• Conclusions
Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
<it, 10, 1>
<was, 10, 2>
<a, 10, 3>
<dragon, 10, 4>
Creating an Inverted Index
November 2009 Giorgos Margaritis - University of Ioannina, Greece
<9:2,7> <10:2>Parser
doc: 10
It was a
dragon.
For each document:
1. Parse document into postings
2. Accumulate postings into memory
3. If memory full, flush postings
to disk, updating inverted lists
was: <3:8> <5:2> <9:2,7> <10:2>
was
memory
full
Postings Flushing: Re-Merge
• Re-Merge [Lester et al., 2004]
 Inverted lists stored in lexicographic order
 For each list: 1. Read in memory, 2. Add new postings, 3. Write to disk
• Cost
 Efficient for lists with few postings (sequential read)
 Retrieve ALL lists from disk and write them ALL back to disk
November 2009 Giorgos Margaritis - University of Ioannina, Greece
Memory
Disk
Postings Flushing: In-Place Update
• In-Place Update [Tomasic et al., 1994]
 Create list: pre-allocate some empty space after each it
 For each list: append new postings on disk.
If not enough space, relocate list in position with sufficient space
• Cost
 Efficient for lists with lots of postings (append instead of read + write)
 Lists can't be kept in lexicographical order random seeks for each list
November 2009 Giorgos Margaritis - University of Ioannina, Greece
Memory
Disk
Postings Flushing: Hybrid Merge
• Hybrid Merge [Büttcher et al., 2008]
 Categorizes terms into short (rare) and long (frequent), based on their
inverted lists length
 Use Re-Merge method for short terms and In-Place Update for long
• Combines the benefits of the two methods
 Re-Merge: proper for updating a big number of short lists
 In-Place Update: proper for updating a small number of long lists
• Two problems
 Cost of reading and writing back all short lists for Re-Merge
 Cost of relocations for In-Place Update
November 2009 Giorgos Margaritis - University of Ioannina, Greece
Outline
• Motivation
• Background
• Proposed Algorithm
• Experiments
• Conclusions
Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
System Architecture
• Organize disk index in fixed-size blocks
• Categorize terms in short or long, based on their list size
• Group short terms into disjoint lexicographical ranges
 Lists of a range must fit into one block
• A block may contain:
 a group of short lists, updated using "read lists, add postings, write lists"
 postings of a long list, updated using appends
• When memory is full:
 selectively flush some postings from memory to disk
 update only those blocks that can be efficiently updated
November 2009 Giorgos Margaritis - University of Ioannina, Greece
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
on-disk inverted lists
MEMORY
DISK
new postings
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
on-disk inverted lists
MEMORY
DISK
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
new postings
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
on-disk inverted lists
MEMORY
DISK
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
new postings
Parameters
• Preference Factor (F)
 Prefer to flush long term T instead of range R:
 flush long term: cheap – single disk append
 flush range: costly – read all lists, add postings, write all lists
 Caveat: if T has too few memory postings does not worth flushing
 if T has F times less postings than R, flush R
• Flush Memory (M)
 Flush only a small portion of memory (e.g. M = 20MB)
 Update only blocks that can be efficiently updated
 Free some memory space
November 2009 Giorgos Margaritis - University of Ioannina, Greece
"Selective Range Flush" Algorithm
November 2009 Giorgos Margaritis - University of Ioannina, Greece
1. Parse document into postings
2. When memory is full:
While ( size of flushed postings < M )
R := range with max postings in memory ;
T := long term with max postings in memory ;
If ( R.mem_postings < F  T.mem_postings ) then
flush T ; /* append new postings on disk */
Else
flush R ; /* read lists from block, add new postings, write lists back to block */
3. Return to parsing phase
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
on-disk inverted lists
MEMORY
DISK
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
new postings
F = 2
M = 15
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
on-disk inverted lists
R T
MEMORY
DISK
new postings
F = 2
M = 15
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
on-disk inverted lists
R T
MEMORY
DISK
new postings
F = 2
M = 15
postings of R < F  postings of T
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
new postings
on-disk inverted lists
R
MEMORY
DISK
T
F = 2
M = 15
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
new postings
on-disk inverted lists
TR
MEMORY
DISK
F = 2
M = 15
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
new postings
on-disk inverted lists
TR
MEMORY
DISK
F = 2
M = 15
postings of R > F  postings of T
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
new postings
on-disk inverted lists
TR
MEMORY
DISK
F = 2
M = 15
"a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable
Proteus
November 2009 Giorgos Margaritis - University of Ioannina, Greece
new postings
on-disk inverted lists
TR
MEMORY
DISK
F = 2
M = 15
we have flushed M postings
Block Overflow
Block of range R overflows:
 allocate block
 move half of postings to new block
 split R according to the lists each block contains
• Last block of list of long term T overflows:
 allocate block
 move overflown postings to new block
November 2009 Giorgos Margaritis - University of Ioannina, Greece
"cat" – "dog"
"cat" – "day" "daze" - dog"
Innovative Work
• Simplify disk management by splitting postings into disk blocks
 Relax requirement of full contiguity for long terms
 Maintain requirement of full contiguity for each short term
• Treat short terms in ranges for efficiency
• Partially flush both short and long terms when memory full
• Dynamically decide whether to flush long term or range
November 2009 Giorgos Margaritis - University of Ioannina, Greece
Outline
• Motivation
• Background
• Proposed Algorithm
• Experiments
• Conclusions
Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
Experiments
• Experimental environment
 Nodes: quad-core at 2.33Ghz, 3GB memory, 2 SATA disks (500GB each)
 Text collection: 426GB GOV2 TREC Terabyte track
 Search queries: Efficiency Topics query set of TREC 2005 Terabyte Track
• Proteus: Selective Range Flush prototype implementation
 Parser from Zettair search engine (RMIT University, Australia)
• Compare with Hybrid Merge
 Wumpus search engine (U. Waterloo, Canada)
November 2009 Giorgos Margaritis - University of Ioannina, Greece
Parameters
Parameter Default value
Block size 8MB
Postings memory 1GB
Flush memory 20MB
Preference factor 3
Long list threshold 1MB
November 2009 Giorgos Margaritis - University of Ioannina, Greece
List Retrieval
• Proteus short list retrieval times similar to Hybrid Merge
 both systems keep short lists contiguous
• Proteus long list retrieval times 2-3 times lower than Hybrid Merge
 due to Wumpus implementation, not H.M. algorithm
• Proteus "CNT" variant:
 keeps long lists contiguous using relocations
November 2009 Giorgos Margaritis - University of Ioannina, Greece
22,2 22,9
0
10
20
30
40
50
H.M. Wumpus Proteus
milliseconds
Short Terms: Average Retrieval Time
416
145 126
0
100
200
300
400
500
H.M.
Wumpus
Proteus Proteus CNT
milliseconds
Long Terms: Average Retrieval Time
Inverted Index Construction
• Proteus:
 30% reduction in build time comparing to H.M.
• Proteus CNT:
 3% more time (14 min) to keep all lists contiguous
November 2009 Giorgos Margaritis - University of Ioannina, Greece
730
514 528
0
100
200
300
400
500
600
700
800
H.M. Wumpus Proteus Proteus CNT
minutes
Index build time
• Increase block size
 increase range flushing time:
 more postings per range block
 more bytes transferred between memory and disk per flush
 decrease long term flushing time:
 same amount of bytes flushed to disk, fewer appends
Sensitivity Analysis: Block size
November 2009 Giorgos Margaritis - University of Ioannina, Greece
Block size
0
200
400
600
800
2MB 4MB 8MB 16MB 32MB
minutes
Index Build
Flush ranges
Flush long terms
Document parsing
• Small values:
 don’t create sufficient space to accumulate new postings
• Large values:
 flush ranges and long terms with very few postings 
 disk head movement overhead
Flush memory
November 2009 Giorgos Margaritis - University of Ioannina, Greece
Flush Memory
0
100
200
300
400
500
600
2MB 5MB 10MB 20MB 40MB 80MB
minutes Index Build
Flush ranges
Flush long terms
Document parsing
• Small values:
 don’t create sufficient space to accumulate new postings
• Large values:
 flush ranges and long terms with very few postings 
 disk head movement overhead
Flush memory
November 2009 Giorgos Margaritis - University of Ioannina, Greece
Flush Memory
0
100
200
300
400
500
600
2MB 5MB 10MB 20MB 40MB 80MB
minutes Index Build
Flush ranges
Flush long terms
Document parsing
"Flush only a small percentage (2% - 3%)
of total memory ..."
• Small values:
 underestimate cost of flushing ranges
• Large values:
 underestimate cost of flushing long terms
Preference Factor
November 2009 Giorgos Margaritis - University of Ioannina, Greece
0
100
200
300
400
500
600
1 2 3 4 8 16
minutes Index Build
Flush ranges
Flush long terms
Document parsing
Preference Factor
• Small values:
 underestimate cost of flushing ranges
• Large values:
 underestimate cost of flushing long terms
Preference Factor
November 2009 Giorgos Margaritis - University of Ioannina, Greece
0
100
200
300
400
500
600
1 2 3 4 8 16
minutes Index Build
Flush ranges
Flush long terms
Document parsing
Preference Factor
"Postpone flushing of ranges,
but not too much …"
Conclusions & Future work
• Fragmentation can harm retrieval times for short lists
• Proteus:
 manage inverted index in blocks
 keep short lists contiguous, long lists in large fragments
 memory full: selectively flush some postings
 update only blocks that can be efficiently updated
 30% reduction in build time
 constant 15% increase in retrieval times of lists > 8MB
 3% increase in build time to keep all lists contiguous
 relatively insensitive to parameter values
• Future work:
 different block sizes for short and long terms, auto adjust parameter
values based on dataset / hardware, different cost models for flushing
Giorgos Margaritis - University of Ioannina, GreeceNovember 2009

Mais conteúdo relacionado

Último

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Último (20)

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Destaque

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Low-cost Management of Inverted Files for Online Full-Text Search

  • 1. Low-cost Management of Inverted Files for Online Full-Text Search Giorgos Margaritis , Stergios V. Anastasiadis Systems Research Group, University of Ioannina, Greece CIKM ‘09, Hong Kong, November 2 - 6, 2009
  • 2. Introduction • Prices of hard disks drop  Amount of stored data increases  Need for automated search • Full-text search  User: defines terms and operands (e.g. logical)  System: finds documents that contain terms and satisfy operands  usually sorts returned documents by "relevance" • Online full-text search  Index updates concurrent with document searches  Need to constantly maintain index up-to-date Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
  • 3. Inverted Index • For each term store an inverted (or posting) list  list of pointers to documents containing term  pointers specify exact position in document (e.g. <doc, offset>)  each pointer is called posting • Lexicon  associates terms to their inverted lists November 2009 Giorgos Margaritis - University of Ioannina, Greece Document 2Document 1 and dark did had …….. where Lexicon Inverted List . . .
  • 4. Build time vs. Search time Trade-off • Contiguous disk storage of inverted lists:  improves access efficiency  needs complex dynamic storage management  requires frequent or bulky relocations • List fragmentation  improves build time  degrades search time Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
  • 5. Inverted List Contiguity • Recent methods  Maintain a (relatively small) number of partial indexes  Retrieve an inverted list: retrieve its fragments from all partial indexes • Majority of inverted lists  Disk size of hundreds of kilobytes • Fragmentation can really harm retrieval times of small lists  More than 99% of terms have list sizes < 10KB (426GB text collection)  Typical disk: 12ms positional time + 0.1ms to sequentially read 10KBb  N fragments N positional times N x (contiguous list retrieval time) Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
  • 6. Inverted List Contiguity • Hybrid Logarithmic Merge [Büttcher et al., 2006]  Create inverted index for 426GB November 2009 Giorgos Margaritis - University of Ioannina, Greece 0 1 2 3 4 5 6 7 0 19 39 58 77 97 116 136 155 174 194 213 232 252 271 290 310 329 349 368 387 407 426 Gigabytes parsed #Partialindexes
  • 7. Design Objectives • 1. Keep small lists contiguous • 2. Keep large lists contiguous, or in few large fragments • 3. Keep build time low • 4. Retrieve lists fast November 2009 Giorgos Margaritis - University of Ioannina, Greece
  • 8. Outline • Motivation • Background • Proposed Algorithm • Experiments • Conclusions Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
  • 9. <it, 10, 1> <was, 10, 2> <a, 10, 3> <dragon, 10, 4> Creating an Inverted Index November 2009 Giorgos Margaritis - University of Ioannina, Greece <9:2,7> <10:2>Parser doc: 10 It was a dragon. For each document: 1. Parse document into postings 2. Accumulate postings into memory 3. If memory full, flush postings to disk, updating inverted lists was: <3:8> <5:2> <9:2,7> <10:2> was memory full
  • 10. Postings Flushing: Re-Merge • Re-Merge [Lester et al., 2004]  Inverted lists stored in lexicographic order  For each list: 1. Read in memory, 2. Add new postings, 3. Write to disk • Cost  Efficient for lists with few postings (sequential read)  Retrieve ALL lists from disk and write them ALL back to disk November 2009 Giorgos Margaritis - University of Ioannina, Greece Memory Disk
  • 11. Postings Flushing: In-Place Update • In-Place Update [Tomasic et al., 1994]  Create list: pre-allocate some empty space after each it  For each list: append new postings on disk. If not enough space, relocate list in position with sufficient space • Cost  Efficient for lists with lots of postings (append instead of read + write)  Lists can't be kept in lexicographical order random seeks for each list November 2009 Giorgos Margaritis - University of Ioannina, Greece Memory Disk
  • 12. Postings Flushing: Hybrid Merge • Hybrid Merge [Büttcher et al., 2008]  Categorizes terms into short (rare) and long (frequent), based on their inverted lists length  Use Re-Merge method for short terms and In-Place Update for long • Combines the benefits of the two methods  Re-Merge: proper for updating a big number of short lists  In-Place Update: proper for updating a small number of long lists • Two problems  Cost of reading and writing back all short lists for Re-Merge  Cost of relocations for In-Place Update November 2009 Giorgos Margaritis - University of Ioannina, Greece
  • 13. Outline • Motivation • Background • Proposed Algorithm • Experiments • Conclusions Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
  • 14. System Architecture • Organize disk index in fixed-size blocks • Categorize terms in short or long, based on their list size • Group short terms into disjoint lexicographical ranges  Lists of a range must fit into one block • A block may contain:  a group of short lists, updated using "read lists, add postings, write lists"  postings of a long list, updated using appends • When memory is full:  selectively flush some postings from memory to disk  update only those blocks that can be efficiently updated November 2009 Giorgos Margaritis - University of Ioannina, Greece
  • 15. Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece on-disk inverted lists MEMORY DISK new postings
  • 16. Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece on-disk inverted lists MEMORY DISK "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable new postings
  • 17. Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece on-disk inverted lists MEMORY DISK "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable new postings
  • 18. Parameters • Preference Factor (F)  Prefer to flush long term T instead of range R:  flush long term: cheap – single disk append  flush range: costly – read all lists, add postings, write all lists  Caveat: if T has too few memory postings does not worth flushing  if T has F times less postings than R, flush R • Flush Memory (M)  Flush only a small portion of memory (e.g. M = 20MB)  Update only blocks that can be efficiently updated  Free some memory space November 2009 Giorgos Margaritis - University of Ioannina, Greece
  • 19. "Selective Range Flush" Algorithm November 2009 Giorgos Margaritis - University of Ioannina, Greece 1. Parse document into postings 2. When memory is full: While ( size of flushed postings < M ) R := range with max postings in memory ; T := long term with max postings in memory ; If ( R.mem_postings < F  T.mem_postings ) then flush T ; /* append new postings on disk */ Else flush R ; /* read lists from block, add new postings, write lists back to block */ 3. Return to parsing phase
  • 20. Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece on-disk inverted lists MEMORY DISK "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable new postings F = 2 M = 15
  • 21. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece on-disk inverted lists R T MEMORY DISK new postings F = 2 M = 15
  • 22. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece on-disk inverted lists R T MEMORY DISK new postings F = 2 M = 15 postings of R < F  postings of T
  • 23. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece new postings on-disk inverted lists R MEMORY DISK T F = 2 M = 15
  • 24. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece new postings on-disk inverted lists TR MEMORY DISK F = 2 M = 15
  • 25. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece new postings on-disk inverted lists TR MEMORY DISK F = 2 M = 15 postings of R > F  postings of T
  • 26. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece new postings on-disk inverted lists TR MEMORY DISK F = 2 M = 15
  • 27. "a" – "lost" "lot " – "pass" "past" – "zoo" "an" "the" termtablerangetable Proteus November 2009 Giorgos Margaritis - University of Ioannina, Greece new postings on-disk inverted lists TR MEMORY DISK F = 2 M = 15 we have flushed M postings
  • 28. Block Overflow Block of range R overflows:  allocate block  move half of postings to new block  split R according to the lists each block contains • Last block of list of long term T overflows:  allocate block  move overflown postings to new block November 2009 Giorgos Margaritis - University of Ioannina, Greece "cat" – "dog" "cat" – "day" "daze" - dog"
  • 29. Innovative Work • Simplify disk management by splitting postings into disk blocks  Relax requirement of full contiguity for long terms  Maintain requirement of full contiguity for each short term • Treat short terms in ranges for efficiency • Partially flush both short and long terms when memory full • Dynamically decide whether to flush long term or range November 2009 Giorgos Margaritis - University of Ioannina, Greece
  • 30. Outline • Motivation • Background • Proposed Algorithm • Experiments • Conclusions Giorgos Margaritis - University of Ioannina, GreeceNovember 2009
  • 31. Experiments • Experimental environment  Nodes: quad-core at 2.33Ghz, 3GB memory, 2 SATA disks (500GB each)  Text collection: 426GB GOV2 TREC Terabyte track  Search queries: Efficiency Topics query set of TREC 2005 Terabyte Track • Proteus: Selective Range Flush prototype implementation  Parser from Zettair search engine (RMIT University, Australia) • Compare with Hybrid Merge  Wumpus search engine (U. Waterloo, Canada) November 2009 Giorgos Margaritis - University of Ioannina, Greece
  • 32. Parameters Parameter Default value Block size 8MB Postings memory 1GB Flush memory 20MB Preference factor 3 Long list threshold 1MB November 2009 Giorgos Margaritis - University of Ioannina, Greece
  • 33. List Retrieval • Proteus short list retrieval times similar to Hybrid Merge  both systems keep short lists contiguous • Proteus long list retrieval times 2-3 times lower than Hybrid Merge  due to Wumpus implementation, not H.M. algorithm • Proteus "CNT" variant:  keeps long lists contiguous using relocations November 2009 Giorgos Margaritis - University of Ioannina, Greece 22,2 22,9 0 10 20 30 40 50 H.M. Wumpus Proteus milliseconds Short Terms: Average Retrieval Time 416 145 126 0 100 200 300 400 500 H.M. Wumpus Proteus Proteus CNT milliseconds Long Terms: Average Retrieval Time
  • 34. Inverted Index Construction • Proteus:  30% reduction in build time comparing to H.M. • Proteus CNT:  3% more time (14 min) to keep all lists contiguous November 2009 Giorgos Margaritis - University of Ioannina, Greece 730 514 528 0 100 200 300 400 500 600 700 800 H.M. Wumpus Proteus Proteus CNT minutes Index build time
  • 35. • Increase block size  increase range flushing time:  more postings per range block  more bytes transferred between memory and disk per flush  decrease long term flushing time:  same amount of bytes flushed to disk, fewer appends Sensitivity Analysis: Block size November 2009 Giorgos Margaritis - University of Ioannina, Greece Block size 0 200 400 600 800 2MB 4MB 8MB 16MB 32MB minutes Index Build Flush ranges Flush long terms Document parsing
  • 36. • Small values:  don’t create sufficient space to accumulate new postings • Large values:  flush ranges and long terms with very few postings   disk head movement overhead Flush memory November 2009 Giorgos Margaritis - University of Ioannina, Greece Flush Memory 0 100 200 300 400 500 600 2MB 5MB 10MB 20MB 40MB 80MB minutes Index Build Flush ranges Flush long terms Document parsing
  • 37. • Small values:  don’t create sufficient space to accumulate new postings • Large values:  flush ranges and long terms with very few postings   disk head movement overhead Flush memory November 2009 Giorgos Margaritis - University of Ioannina, Greece Flush Memory 0 100 200 300 400 500 600 2MB 5MB 10MB 20MB 40MB 80MB minutes Index Build Flush ranges Flush long terms Document parsing "Flush only a small percentage (2% - 3%) of total memory ..."
  • 38. • Small values:  underestimate cost of flushing ranges • Large values:  underestimate cost of flushing long terms Preference Factor November 2009 Giorgos Margaritis - University of Ioannina, Greece 0 100 200 300 400 500 600 1 2 3 4 8 16 minutes Index Build Flush ranges Flush long terms Document parsing Preference Factor
  • 39. • Small values:  underestimate cost of flushing ranges • Large values:  underestimate cost of flushing long terms Preference Factor November 2009 Giorgos Margaritis - University of Ioannina, Greece 0 100 200 300 400 500 600 1 2 3 4 8 16 minutes Index Build Flush ranges Flush long terms Document parsing Preference Factor "Postpone flushing of ranges, but not too much …"
  • 40. Conclusions & Future work • Fragmentation can harm retrieval times for short lists • Proteus:  manage inverted index in blocks  keep short lists contiguous, long lists in large fragments  memory full: selectively flush some postings  update only blocks that can be efficiently updated  30% reduction in build time  constant 15% increase in retrieval times of lists > 8MB  3% increase in build time to keep all lists contiguous  relatively insensitive to parameter values • Future work:  different block sizes for short and long terms, auto adjust parameter values based on dataset / hardware, different cost models for flushing Giorgos Margaritis - University of Ioannina, GreeceNovember 2009