SlideShare uma empresa Scribd logo
1 de 41
Fun with Flexible Indexing
Mike McCandless, IBM
10/8/2010
1
Agenda
• Who am I?
• Motivation
• New flex APIs
• Codecs
• Wrap up
2
Your ideas will go further if you don’t insist on going with them.
Who am I?
• Committer, PMC member Lucene/Solr
• Co-author of Lucene in Action, 2nd edition
– LUCENEREV40 promo code!
• Blog: http://chbits.blogspot.com
• Emacs, Python lover
• Sponsored by IBM
3
Better to ask forgiveness than permission.
Motivation
• Lucene is showing its age
– vInt is costly
• Lucene is hard to change, at low-levels
– Index format is too rigid
• Yet, lots of innovation in the IR world...
– New compression formats, data structures,
scorings models, etc.
• IR researchers use other search engines
– Terrier, Lemur/Indri, MG4J, etc.
4
Actions speak louder than words.
An example: omitTFAP
• Added in version 2.4
• Turns off positions, termFreq
• 50 KB patch, 25 core source files!
• Follow-on (LUCENE-2048) still open...
• This was a simple change!
– What about harder changes, eg better encoding?
• Yes, devs can make these changes... but
that’s not good enough
5
If you’re not making mistakes, you’re not trying hard enough.
Motivation
• Goal 1: make innovation easy(ier)
– You shouldn’t have to be a rocket scientist to try
out new ideas
– But: can’t lose performance
• Goal 2: innovate
– Catch up to state-of-the-art in IR world
6
Agenda
• Who am I?
• Motivation
• New flex APIs
• Codecs
• Wrap up
7
Inverted Index 101
8
open
pod
door
bay
hal
body
title
sweet
3 7 14 19 ...
5
11
22
...
payload
payload
payload
...
Field
Term
Doc ID
Positions
SortedMap<Field,
SortedMap<Term,
List<Doc ID,
List<Pos, Payload>
>
>
>
Don’t trade your passion for glory.
Flex overview
• 4.0 (trunk) only!
• New low-level postings enum API
• Pluggable, per-segment codec has full
control over reading/writing postings
– Building blocks make it easy to create your own
– Some neat codecs!
• Performance gains
– Much less RAM used
– Faster queries, filters
9
Flex is very low level
10
Codec
Indexing Searching
Disk
Flex APIs
Content Users
If two people always agree, one is not necessary.
4D enum API
• Fields, FieldsEnum
– field
• Terms, TermsEnum
– term, docFreq, ord
• DocsEnum
– docID, freq
• DocsAndPositionsEnum
– docID, freq, position, payload
• All enums allow custom attrs
11
Absolute power corrupts absolutely.
API: TermsEnum
• Iterates through all unique terms
– Separates terms from field
• Each term is opaque, fully binary
– BytesRef (slices a byte[])
– New analysis attr provides BytesRef per token
– Collation, numeric fields can use full term space
• Char terms can use any encoding
– Default is UTF8 (some queries rely on this)
– Others are possible (eg BOCU1, LUCENE-1799)
12
Life is about the journey, not the destination.
API: TermsEnum
• You can now re-seek an existing TermsEnum
• Seek gives explicit return result
– FOUND, NOT_FOUND, END
• Ord, seek-by-ord (optional, only for segment)
• Enables seek-intensive queries
– Eg AutomatonQuery
– FuzzyQuery is much faster for N=1,2!
– New automaton spell-checker also uses
FuzzyTermsEnum (LUCENE-2507)
13
• Term sort order is determined by codec
– Comparator<BytesRef> getComparator()
• Core codecs use unsigned byte[] order
– Unicode code point if byte[] is UTF8
• If you change this, some queries won’t work!
There is no security on this earth; only opportunity.
API: TermsEnum
14
Happiness = expectations minus reality.
FieldCache improvements
• FieldCache consumes the flex APIs
• Terms / terms index field cache more RAM
efficient, low GC load
– Used with SortField.STRING
• Shared byte[] blocks instead of separate
String instances
– Term remain as byte[]
• Packed ints for ords, addresses
• RAM reduction ~40-60%
15
The best way to learn is to do.
API: Docs/AndPositionsEnum
• API very similar to 3.x
– Still extends DISI
• TermsEnum provides Docs/
AndPositionsEnum
• Bulk read API exists but still in flux
(LUCENE-1410)
• You provide the skip docs
– Deleted docs are no longer silently skipped
16
Fish for someone, they eat for a day. Teach them to
fish, they eat for a lifetime.
Custom skip docs
• IndexReader provides .getDeletedDocs
– Replaces .isDeleted
• Queries pass the deleted docs
– But you can customize!
• Example: FilterIndexReader subclass
– Apply random-access filter “down low”
– ~40-130% gain for many queries, 50% filter
– LUCENE-1536 is the real fix
– http://s.apache.org/PNA
17
Agenda
• Who am I?
• Motivation
• New flex APIs
• Codecs
• Wrap up
18
Sweet are the uses of adversity.
What’s really in a codec?
• Codec provides read/write for one segment
– Unique name (String)
– FieldsConsumer (for writing)
– FieldsProducer is 4D enum API + close
• CodecProvider creates Codec instance
– Passed to IndexWriter/Reader
• You can override merging
• Reusable building blocks
– Terms dict + index, Postings
19
Always under-promise and over-deliver.
Testing Codecs
• All unit tests now randomly swap codecs
• If you hit a random test failure, please post to
dev, including random seed
• Easily test your own codec!
20
Don’t attribute to malice that which can be otherwise explained.
Standard codec
• Default codec
– On upgrade, newly written segments use this
• Terms dict: PrefixCodedTerms
• Terms index: FixedGapTermsIndex
• Postings: StandardPostingsWriter/Reader
– Same vInt encoding as 3.x
21
Imagination is more important than knowledge.
PrefixCodedTerms
• Terms dict
• Responsible for Fields/Enum, Terms/Enum
– Maps term to byte[], docFreq, file offsets
• Shared prefix of adjacent terms is trimmed
• Pluggable terms index, postings impl
• Format
– Separate sections per-field
22
The reasonable person adapts himself to the world...
FixedGapTermsIndex
• Every Nth term is indexed
– Loaded fully into RAM
• RAM image is written at indexing time
– Very fast reader init, low GC load
– Parallel arrays instead of instance per term
• Index term points to edge between terms
– Vs 3.x where index term was a full entry
• Useless suffix removal
– a, abracadabra
23
...the unreasonable one persists in trying to adapt the
world to himself...
FixedGapTermsIndex
• Much better RAM/GC efficiency
• HathiTrust terms index
– 22.2 M indexed terms
– 3.x: 3974 MB RAM, 72.8 sec to load
– 4.0: 401 MB RAM, 2.2 sec to load
– 9.9 X less RAM, 33X faster
• Wikipedia 3.8X less RAM
– http://s.apache.org/OWK
• Default terms index gap changed 128 -> 32
24
• Reads 3.x index format
• Read-only!
– Except: tests swap in a read/write version
• Surrogates dance dynamically reorders
UTF16 sort order to unicode
– Sophisticated backwards compatibility layer!
..therefore all progress depends on the unreasonable person.
PreFlex codec
25
Progress not perfection.
Pulsing codec
• Inlines low doc-freq terms into terms dict
• Saves extra seek to get the postings
• Excellent match for primary key fields, but
also “normal” field (Zipf’s law)
• Wraps any other codec
• Likely default codec will use Pulsing
• http://s.apache.org/JX3
26
Pulsing codec speedup
27
Holding a grudge is like swallowing poison and waiting for
the other person to die.
SimpleText codec
• All postings stored in _X.pst text file
• Read / write
• Not performant
– Do not use in production!
• Fully functional
– Passes all Lucene/Solr unit tests (slowly...)
• Useful/fun for debugging
• http://s.apache.org/eh
28
SimpleText codec
29
field body
term bay
doc 0
pos 3
term doors
doc 0
pos 4
term hal
doc 0
pos 5
term open
doc 0
pos 0
term pod
doc 0
pos 2
term the
doc 0
pos 1
END
Fool me once, shame on you...
Int block codec
• Abstract codec
– Tests define Mock variable & fixed, with random
block sizes
• Encodes doc, frq, pos using block codecs
– Encoding/decoding block of ints at once
• Fixed & variable blocks
• Easy to use: define flushBlock, readBlock
• Seek point requires pointer and block offset
30
Fool me twice, shame on me.
FOR/PFOR codec
• Subclasses FixedIntBlock codec
• FOR (frame of reference) = packed ints
– eg: 1, 7, 3, 5, 2, 2, 5 needs only 3 bits per value
• PFOR adds exceptions handling
– eg: 1, 7, 3, 5, 293, 2, 2, 5 encodes 293 as vInt
• Not committed yet (LUCENE-1410)
• Initial results: ~20-40% speedup for many
queries
• http://s.apache.org/lw
31
Life is a series of one-way doors; pick yours carefully.
Other Codecs
• PerFieldCodecWrapper
• AppendingCodec
– Never rewinds a file pointer during write
• TeeSinkCodec
– Write postings to multiple destinations
• FilteringCodec
– Filter postings as they are written
• YourCodecGoesHereSoon
32
Agenda
• Who am I?
• Motivation
• New flex APIs
• Codecs
• Wrap up
33
The first investment is yourself.
Some ideas to try
• In-memory postings
– Maybe only terms dict, select postings, etc.
• Variable-gap terms index
– Add indexed term if docFreq > N
– Good for noisy terms (eg, OCR)
• DFA/trie/FST as terms dict/index
• Finer omitTFAP (OmitTF, OmitP, per-term)
• Block-encoding for terms dict sections
34
Only the paranoid survive.
Still to do
• Performance bottleneck of int block codecs
• Codec should include norms, stored fields,
term vectors (LUCENE-2621)
• Enable serialization of attrs
• Switch to default hybrid (Pulsing, Standard,
PForDelta) codec
• Expose codec configuration in Solr
35
Summary
• New 4D postings enum apis
• Pluggable codec lets you customize index
format
– Many codecs already available
• Goal 1 is realized: innovation is easy(ier)!
– Exciting time for Lucene...
• Goal 2 is in progress...
• Sizable performance gains, RAM/GC
reduction coming in 4.0
36
¿Preguntas?
37
Backup
38
Composite vs atomic readers
• Lucene has aggressively moved to “per
segment” search, starting at 2.9
• Flex furthers this!
• Best to work directly with sub-readers
– Use direct flex APIs, eg reader.fields(), for this
• If you must operate on composite reader...
– Use MultiFields.getFields(reader), or
– SlowMultiReaderWrapper.wrap
– Beware performance hit!
39
Code: visit docs containing a term
40
Fields fields = reader.fields();
Terms terms = fields.terms(“body”);
TermsEnum iter = terms.iterator();
if (iter.seek(new BytesRef(“pod”)) ==
SeekStatus.FOUND) {
DocsEnum docs = iter.docs(null);
int docID;
while ((docID = docs.nextDoc()) !=
DocsEnum.NO_MORE_DOCS) {
...
}
}
41
Explore more about Flexible Indexing at
www.lucidimagination.com

Mais conteúdo relacionado

Mais procurados

Program Structure in GNU/Linux (ELF Format)
Program Structure in GNU/Linux (ELF Format)Program Structure in GNU/Linux (ELF Format)
Program Structure in GNU/Linux (ELF Format)Varun Mahajan
 
Early Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator ModelEarly Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator ModelChunhua Liao
 
Cvpr2010 open source vision software, intro and training part vi robot operat...
Cvpr2010 open source vision software, intro and training part vi robot operat...Cvpr2010 open source vision software, intro and training part vi robot operat...
Cvpr2010 open source vision software, intro and training part vi robot operat...zukun
 
Programming with \'C\'
Programming with \'C\'Programming with \'C\'
Programming with \'C\'bdmsts
 
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs Sam Bowne
 
Pré Descobrimento Do Brasil
Pré Descobrimento Do BrasilPré Descobrimento Do Brasil
Pré Descobrimento Do Brasilecsette
 
Page List & Sample Material (Repaired)
Page List & Sample Material (Repaired)Page List & Sample Material (Repaired)
Page List & Sample Material (Repaired)Muhammad Haseeb Shahid
 
P4 P Update January 2009
P4 P Update January 2009P4 P Update January 2009
P4 P Update January 2009vsainteluce
 

Mais procurados (20)

Program Structure in GNU/Linux (ELF Format)
Program Structure in GNU/Linux (ELF Format)Program Structure in GNU/Linux (ELF Format)
Program Structure in GNU/Linux (ELF Format)
 
The Internals of "Hello World" Program
The Internals of "Hello World" ProgramThe Internals of "Hello World" Program
The Internals of "Hello World" Program
 
Early Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator ModelEarly Experiences with the OpenMP Accelerator Model
Early Experiences with the OpenMP Accelerator Model
 
Eusecwest
EusecwestEusecwest
Eusecwest
 
Simulating TUM Drone 2.0 by ROS
Simulating TUM Drone 2.0  by ROSSimulating TUM Drone 2.0  by ROS
Simulating TUM Drone 2.0 by ROS
 
Cvpr2010 open source vision software, intro and training part vi robot operat...
Cvpr2010 open source vision software, intro and training part vi robot operat...Cvpr2010 open source vision software, intro and training part vi robot operat...
Cvpr2010 open source vision software, intro and training part vi robot operat...
 
Programming with \'C\'
Programming with \'C\'Programming with \'C\'
Programming with \'C\'
 
Crosslingual search-engine
Crosslingual search-engineCrosslingual search-engine
Crosslingual search-engine
 
Linux Programming
Linux ProgrammingLinux Programming
Linux Programming
 
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
Practical Malware Analysis: Ch 7: Analyzing Malicious Windows Programs
 
Linux-Internals-and-Networking
Linux-Internals-and-NetworkingLinux-Internals-and-Networking
Linux-Internals-and-Networking
 
Pré Descobrimento Do Brasil
Pré Descobrimento Do BrasilPré Descobrimento Do Brasil
Pré Descobrimento Do Brasil
 
Linux device drivers
Linux device drivers Linux device drivers
Linux device drivers
 
Lzw algorithm
Lzw algorithmLzw algorithm
Lzw algorithm
 
Page List & Sample Material (Repaired)
Page List & Sample Material (Repaired)Page List & Sample Material (Repaired)
Page List & Sample Material (Repaired)
 
Embedded linux network device driver development
Embedded linux network device driver developmentEmbedded linux network device driver development
Embedded linux network device driver development
 
Ebay News 2001 4 19 Earnings
Ebay News 2001 4 19 EarningsEbay News 2001 4 19 Earnings
Ebay News 2001 4 19 Earnings
 
Ebay News 2000 10 19 Earnings
Ebay News 2000 10 19 EarningsEbay News 2000 10 19 Earnings
Ebay News 2000 10 19 Earnings
 
P4 P Update January 2009
P4 P Update January 2009P4 P Update January 2009
P4 P Update January 2009
 
Basic of java
Basic of javaBasic of java
Basic of java
 

Destaque

How The Guardian Embraced the Internet using Content, Search, and Open Source
How The Guardian Embraced the Internet using Content, Search, and Open SourceHow The Guardian Embraced the Internet using Content, Search, and Open Source
How The Guardian Embraced the Internet using Content, Search, and Open SourceLucidworks (Archived)
 
Lucene rev preso bialecki solr crawlers-lr
Lucene rev preso bialecki solr crawlers-lrLucene rev preso bialecki solr crawlers-lr
Lucene rev preso bialecki solr crawlers-lrLucidworks (Archived)
 

Destaque (6)

What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9
 
How The Guardian Embraced the Internet using Content, Search, and Open Source
How The Guardian Embraced the Internet using Content, Search, and Open SourceHow The Guardian Embraced the Internet using Content, Search, and Open Source
How The Guardian Embraced the Internet using Content, Search, and Open Source
 
What’s New in Solr 1.4
What’s New in Solr 1.4What’s New in Solr 1.4
What’s New in Solr 1.4
 
Starting a search application
Starting a search applicationStarting a search application
Starting a search application
 
What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9What’s New in Apache Lucene 2.9
What’s New in Apache Lucene 2.9
 
Lucene rev preso bialecki solr crawlers-lr
Lucene rev preso bialecki solr crawlers-lrLucene rev preso bialecki solr crawlers-lr
Lucene rev preso bialecki solr crawlers-lr
 

Semelhante a Fun with flexible indexing

Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPrashant Rane
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialeckilucenerevolution
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
 
Parallel Computing - Lec 3
Parallel Computing - Lec 3Parallel Computing - Lec 3
Parallel Computing - Lec 3Shah Zaib
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
2.4 Optimizing your Visual COBOL Applications
2.4   Optimizing your Visual COBOL Applications2.4   Optimizing your Visual COBOL Applications
2.4 Optimizing your Visual COBOL ApplicationsMicro Focus
 
Get More Out of MongoDB with TokuMX
Get More Out of MongoDB with TokuMXGet More Out of MongoDB with TokuMX
Get More Out of MongoDB with TokuMXTim Callaghan
 
Linux operating system by Quontra Solutions
Linux operating system by Quontra SolutionsLinux operating system by Quontra Solutions
Linux operating system by Quontra SolutionsQUONTRASOLUTIONS
 
Top schools in noida
Top schools in noidaTop schools in noida
Top schools in noidaEdhole.com
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptxSimRelokasi2
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinalProf. Wim Van Criekinge
 

Semelhante a Fun with flexible indexing (20)

Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0
 
Kafka overview v0.1
Kafka overview v0.1Kafka overview v0.1
Kafka overview v0.1
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
 
Portable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej BialeckiPortable Lucene Index Format & Applications - Andrzej Bialecki
Portable Lucene Index Format & Applications - Andrzej Bialecki
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
Parallel Computing - Lec 3
Parallel Computing - Lec 3Parallel Computing - Lec 3
Parallel Computing - Lec 3
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
2.4 Optimizing your Visual COBOL Applications
2.4   Optimizing your Visual COBOL Applications2.4   Optimizing your Visual COBOL Applications
2.4 Optimizing your Visual COBOL Applications
 
Solr 4
Solr 4Solr 4
Solr 4
 
Get More Out of MongoDB with TokuMX
Get More Out of MongoDB with TokuMXGet More Out of MongoDB with TokuMX
Get More Out of MongoDB with TokuMX
 
Linux operating system by Quontra Solutions
Linux operating system by Quontra SolutionsLinux operating system by Quontra Solutions
Linux operating system by Quontra Solutions
 
Top schools in noida
Top schools in noidaTop schools in noida
Top schools in noida
 
Rustbridge
RustbridgeRustbridge
Rustbridge
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
 

Mais de Lucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKLucidworks (Archived)
 

Mais de Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 

Último

Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 

Último (20)

Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 

Fun with flexible indexing

  • 1. Fun with Flexible Indexing Mike McCandless, IBM 10/8/2010 1
  • 2. Agenda • Who am I? • Motivation • New flex APIs • Codecs • Wrap up 2
  • 3. Your ideas will go further if you don’t insist on going with them. Who am I? • Committer, PMC member Lucene/Solr • Co-author of Lucene in Action, 2nd edition – LUCENEREV40 promo code! • Blog: http://chbits.blogspot.com • Emacs, Python lover • Sponsored by IBM 3
  • 4. Better to ask forgiveness than permission. Motivation • Lucene is showing its age – vInt is costly • Lucene is hard to change, at low-levels – Index format is too rigid • Yet, lots of innovation in the IR world... – New compression formats, data structures, scorings models, etc. • IR researchers use other search engines – Terrier, Lemur/Indri, MG4J, etc. 4
  • 5. Actions speak louder than words. An example: omitTFAP • Added in version 2.4 • Turns off positions, termFreq • 50 KB patch, 25 core source files! • Follow-on (LUCENE-2048) still open... • This was a simple change! – What about harder changes, eg better encoding? • Yes, devs can make these changes... but that’s not good enough 5
  • 6. If you’re not making mistakes, you’re not trying hard enough. Motivation • Goal 1: make innovation easy(ier) – You shouldn’t have to be a rocket scientist to try out new ideas – But: can’t lose performance • Goal 2: innovate – Catch up to state-of-the-art in IR world 6
  • 7. Agenda • Who am I? • Motivation • New flex APIs • Codecs • Wrap up 7
  • 8. Inverted Index 101 8 open pod door bay hal body title sweet 3 7 14 19 ... 5 11 22 ... payload payload payload ... Field Term Doc ID Positions SortedMap<Field, SortedMap<Term, List<Doc ID, List<Pos, Payload> > > >
  • 9. Don’t trade your passion for glory. Flex overview • 4.0 (trunk) only! • New low-level postings enum API • Pluggable, per-segment codec has full control over reading/writing postings – Building blocks make it easy to create your own – Some neat codecs! • Performance gains – Much less RAM used – Faster queries, filters 9
  • 10. Flex is very low level 10 Codec Indexing Searching Disk Flex APIs Content Users
  • 11. If two people always agree, one is not necessary. 4D enum API • Fields, FieldsEnum – field • Terms, TermsEnum – term, docFreq, ord • DocsEnum – docID, freq • DocsAndPositionsEnum – docID, freq, position, payload • All enums allow custom attrs 11
  • 12. Absolute power corrupts absolutely. API: TermsEnum • Iterates through all unique terms – Separates terms from field • Each term is opaque, fully binary – BytesRef (slices a byte[]) – New analysis attr provides BytesRef per token – Collation, numeric fields can use full term space • Char terms can use any encoding – Default is UTF8 (some queries rely on this) – Others are possible (eg BOCU1, LUCENE-1799) 12
  • 13. Life is about the journey, not the destination. API: TermsEnum • You can now re-seek an existing TermsEnum • Seek gives explicit return result – FOUND, NOT_FOUND, END • Ord, seek-by-ord (optional, only for segment) • Enables seek-intensive queries – Eg AutomatonQuery – FuzzyQuery is much faster for N=1,2! – New automaton spell-checker also uses FuzzyTermsEnum (LUCENE-2507) 13
  • 14. • Term sort order is determined by codec – Comparator<BytesRef> getComparator() • Core codecs use unsigned byte[] order – Unicode code point if byte[] is UTF8 • If you change this, some queries won’t work! There is no security on this earth; only opportunity. API: TermsEnum 14
  • 15. Happiness = expectations minus reality. FieldCache improvements • FieldCache consumes the flex APIs • Terms / terms index field cache more RAM efficient, low GC load – Used with SortField.STRING • Shared byte[] blocks instead of separate String instances – Term remain as byte[] • Packed ints for ords, addresses • RAM reduction ~40-60% 15
  • 16. The best way to learn is to do. API: Docs/AndPositionsEnum • API very similar to 3.x – Still extends DISI • TermsEnum provides Docs/ AndPositionsEnum • Bulk read API exists but still in flux (LUCENE-1410) • You provide the skip docs – Deleted docs are no longer silently skipped 16
  • 17. Fish for someone, they eat for a day. Teach them to fish, they eat for a lifetime. Custom skip docs • IndexReader provides .getDeletedDocs – Replaces .isDeleted • Queries pass the deleted docs – But you can customize! • Example: FilterIndexReader subclass – Apply random-access filter “down low” – ~40-130% gain for many queries, 50% filter – LUCENE-1536 is the real fix – http://s.apache.org/PNA 17
  • 18. Agenda • Who am I? • Motivation • New flex APIs • Codecs • Wrap up 18
  • 19. Sweet are the uses of adversity. What’s really in a codec? • Codec provides read/write for one segment – Unique name (String) – FieldsConsumer (for writing) – FieldsProducer is 4D enum API + close • CodecProvider creates Codec instance – Passed to IndexWriter/Reader • You can override merging • Reusable building blocks – Terms dict + index, Postings 19
  • 20. Always under-promise and over-deliver. Testing Codecs • All unit tests now randomly swap codecs • If you hit a random test failure, please post to dev, including random seed • Easily test your own codec! 20
  • 21. Don’t attribute to malice that which can be otherwise explained. Standard codec • Default codec – On upgrade, newly written segments use this • Terms dict: PrefixCodedTerms • Terms index: FixedGapTermsIndex • Postings: StandardPostingsWriter/Reader – Same vInt encoding as 3.x 21
  • 22. Imagination is more important than knowledge. PrefixCodedTerms • Terms dict • Responsible for Fields/Enum, Terms/Enum – Maps term to byte[], docFreq, file offsets • Shared prefix of adjacent terms is trimmed • Pluggable terms index, postings impl • Format – Separate sections per-field 22
  • 23. The reasonable person adapts himself to the world... FixedGapTermsIndex • Every Nth term is indexed – Loaded fully into RAM • RAM image is written at indexing time – Very fast reader init, low GC load – Parallel arrays instead of instance per term • Index term points to edge between terms – Vs 3.x where index term was a full entry • Useless suffix removal – a, abracadabra 23
  • 24. ...the unreasonable one persists in trying to adapt the world to himself... FixedGapTermsIndex • Much better RAM/GC efficiency • HathiTrust terms index – 22.2 M indexed terms – 3.x: 3974 MB RAM, 72.8 sec to load – 4.0: 401 MB RAM, 2.2 sec to load – 9.9 X less RAM, 33X faster • Wikipedia 3.8X less RAM – http://s.apache.org/OWK • Default terms index gap changed 128 -> 32 24
  • 25. • Reads 3.x index format • Read-only! – Except: tests swap in a read/write version • Surrogates dance dynamically reorders UTF16 sort order to unicode – Sophisticated backwards compatibility layer! ..therefore all progress depends on the unreasonable person. PreFlex codec 25
  • 26. Progress not perfection. Pulsing codec • Inlines low doc-freq terms into terms dict • Saves extra seek to get the postings • Excellent match for primary key fields, but also “normal” field (Zipf’s law) • Wraps any other codec • Likely default codec will use Pulsing • http://s.apache.org/JX3 26
  • 28. Holding a grudge is like swallowing poison and waiting for the other person to die. SimpleText codec • All postings stored in _X.pst text file • Read / write • Not performant – Do not use in production! • Fully functional – Passes all Lucene/Solr unit tests (slowly...) • Useful/fun for debugging • http://s.apache.org/eh 28
  • 29. SimpleText codec 29 field body term bay doc 0 pos 3 term doors doc 0 pos 4 term hal doc 0 pos 5 term open doc 0 pos 0 term pod doc 0 pos 2 term the doc 0 pos 1 END
  • 30. Fool me once, shame on you... Int block codec • Abstract codec – Tests define Mock variable & fixed, with random block sizes • Encodes doc, frq, pos using block codecs – Encoding/decoding block of ints at once • Fixed & variable blocks • Easy to use: define flushBlock, readBlock • Seek point requires pointer and block offset 30
  • 31. Fool me twice, shame on me. FOR/PFOR codec • Subclasses FixedIntBlock codec • FOR (frame of reference) = packed ints – eg: 1, 7, 3, 5, 2, 2, 5 needs only 3 bits per value • PFOR adds exceptions handling – eg: 1, 7, 3, 5, 293, 2, 2, 5 encodes 293 as vInt • Not committed yet (LUCENE-1410) • Initial results: ~20-40% speedup for many queries • http://s.apache.org/lw 31
  • 32. Life is a series of one-way doors; pick yours carefully. Other Codecs • PerFieldCodecWrapper • AppendingCodec – Never rewinds a file pointer during write • TeeSinkCodec – Write postings to multiple destinations • FilteringCodec – Filter postings as they are written • YourCodecGoesHereSoon 32
  • 33. Agenda • Who am I? • Motivation • New flex APIs • Codecs • Wrap up 33
  • 34. The first investment is yourself. Some ideas to try • In-memory postings – Maybe only terms dict, select postings, etc. • Variable-gap terms index – Add indexed term if docFreq > N – Good for noisy terms (eg, OCR) • DFA/trie/FST as terms dict/index • Finer omitTFAP (OmitTF, OmitP, per-term) • Block-encoding for terms dict sections 34
  • 35. Only the paranoid survive. Still to do • Performance bottleneck of int block codecs • Codec should include norms, stored fields, term vectors (LUCENE-2621) • Enable serialization of attrs • Switch to default hybrid (Pulsing, Standard, PForDelta) codec • Expose codec configuration in Solr 35
  • 36. Summary • New 4D postings enum apis • Pluggable codec lets you customize index format – Many codecs already available • Goal 1 is realized: innovation is easy(ier)! – Exciting time for Lucene... • Goal 2 is in progress... • Sizable performance gains, RAM/GC reduction coming in 4.0 36
  • 39. Composite vs atomic readers • Lucene has aggressively moved to “per segment” search, starting at 2.9 • Flex furthers this! • Best to work directly with sub-readers – Use direct flex APIs, eg reader.fields(), for this • If you must operate on composite reader... – Use MultiFields.getFields(reader), or – SlowMultiReaderWrapper.wrap – Beware performance hit! 39
  • 40. Code: visit docs containing a term 40 Fields fields = reader.fields(); Terms terms = fields.terms(“body”); TermsEnum iter = terms.iterator(); if (iter.seek(new BytesRef(“pod”)) == SeekStatus.FOUND) { DocsEnum docs = iter.docs(null); int docID; while ((docID = docs.nextDoc()) != DocsEnum.NO_MORE_DOCS) { ... } }
  • 41. 41 Explore more about Flexible Indexing at www.lucidimagination.com