code4lib 2011 preconference, presented by Erik Hatcher of Lucid Imagination.
Abstract: The library world is fired up about Solr. Practically every next-gen catalog is using it (via Blacklight, VuFind, or other technologies). Solr has continued improving in some dramatic ways, including geospatial support, field collapsing/grouping, extended dismax query parsing, pivot/grid/matrix/tree faceting, autosuggest, and more. This session will cover all of these new features, showcasing live examples of them all, including anything new that is implemented prior to the conference.
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
1. What's New in Solr?
code4lib 2011 preconference
Bloomington, IN
presented by Erik Hatcher of Lucid Imagination
2. about me
spoken at several code4lib conferences
Keynoted Athens '07 along with the pioneering Solr preconference,
Providence '09, "Rising Sun"
pre-conferenced Asheville '10, "Solr Black Belt"
co-authored "Lucene in Action", first edition; ghost/toast on second edition
Lucene and Solr committer.
library world claims to fame are founding and naming Blacklight, original developer on
Collex and the Rossetti Archive search
now at Lucid Imagination, dedicated to Lucene/Solr support/services/training/etc
3. abstract
The library world is fired up about Solr. Practically every
next-gen catalog is using it (via Blacklight, VuFind, or other
technologies). Solr has continued improving in some
dramatic ways, including geospatial support, field
collapsing/grouping, extended dismax query parsing, pivot/
grid/matrix/tree faceting, autosuggest, and more. This
session will cover all of these new features, showcasing
live examples of them all, including anything new that is
implemented prior to the conference.
4. LIA2 - Lucene in Action
Published: July 2010 - http://www.manning.com/lucene/
New in this second edition:
Performing hot backups
Using numeric fields
Tuning for indexing or searching speed
Boosting matches with payloads
Creating reusable analyzers
Adding concurrency with threads
Four new case studies, and more
5. Version Number
Which one ya talking 'bout, Willis?
3.1? 4.0?? TRUNK??
playing with fire
index format changes to be expected
reindexing recommended/required
Solr/Lucene merged development codebases
releases should occur lock-step moving forward
6. dependencies
November 2009: Solr 1.4 (Lucene 2.9.1)
June 2010: Solr 1.4.1 (Lucene 2.9.3)
Spring 2011(?): Solr 3.1 (Lucene 3.1)
TRUNK: Solr 4.x (Lucene TRUNK)
7. lucene
per-segment field cache, etc
Unicode and analysis improvements throughout
Analysis "attributes"
AutomatonQuery: RegexpQuery, WildcardQuery
flexible indexing
and so much more!
10. Standard tokenization
ClassicTokenizer: old StandardTokenizer
StandardTokenizer: now uses Unicode text
segmentation specified by UAX#29
UAX29URLEmailTokenizer
maxTokenLength: default=255
12. CollationKeyFilter
A filter that lets one specify:
A system collator associated with a locale, or
A collator based on custom rules
This can be used for changing sort order for non-english languages as well as
to modify the collation sequence for certain languages. You must use the same
CollationKeyFilter at both index-time and query-time for correct results. Also,
the JVM vendor, version (including patch version) of the slave should be exactly
same as the master (or indexer) for consistent results.
http://wiki.apache.org/solr/UnicodeCollation
see also: ICUCollationKeyFilter
13. ICU
International Components for Unicode
ICUFoldingFilter
ICUNormalizer2Filter
name=nfc|nfkc|nfkc_cf
mode=compose|decompose
filter
14. ICUFoldingFilter
Accent removal, case folding,canonical duplicates folding,dashes
folding,diacritic removal (including stroke, hook, descender), Greek letterforms
folding, Han Radical folding, Hebrew Alternates folding, Jamo folding,
Letterforms folding, Math symbol folding, Multigraph Expansions: All, Native
digit folding, No-break folding, Overline folding, Positional forms folding, Small
forms folding, Space folding, Spacing Accents folding, Subscript folding,
Superscript folding, Suzhou Numeral folding, Symbol folding, Underline folding,
Vertical forms folding, Width folding
Additionally, Default Ignorables are removed, and text is normalized to NFKC.
All foldings, case folding, and normalization mappings are applied recursively
to ensure a fully folded and normalized result.
15. ICUTransformFilter
id: specific transliterator identifier from ICU's
Transliterator#getAvailableIDs()(required)
direction=forward|reverse
Examples:
Traditional-Simplified: =>
Cyrillic-Latin: Российская Федерация =>
Rossijskaâ Federaciâ
19. spatial
JTeam's plugin: packaged for easy deployment
Solr trunk capabilities
many distance functions
What's missing?
geo faceting? scoring by distance? distance
pseudo-field?
All units in kilometers, unless otherwise specified
20. Spatial field types
Point: n-dimensional, must specify dimension
(default=2), represented by N subfields internally
LatLon: latitude,longitude, represented by two
subfields internally, single valued only
GeoHash: single string representation of lat/lon
22. field collapsing/grouping
backwards compatibility mode? sort: how to sort groups, by top
document in each group
http://wiki.apache.org/solr/
FieldCollapsing group.sort: how to sort docs within
each group
group=true
group.format: grouped | simple
group.field / group.func / group.query
group.main=true|false:
rows / start: for groups, not documents
faceting works as normal
group.limit: number of results per
group not distributed savvy yet
group.offset: offset into doclist of each
group
24. {!raw|term|field f=$f}...
Recall why we needed {!raw} from last year
<fieldType = .../> - use one string, one numeric, (and one text?)
<field name="..."/>
table for numeric and for string (and text?):
{!raw f=$f} | TermQuery(...)
{!term f=$f} | ...
{!field f=$f} | ...
Which to use when? {!raw} works for strings just fine, but best to migrate to the generally
safer/wiser {!term} for future-proofing.
26. dismax
q.op or schema.xml's <solrQueryParser
defaultOperator="[AND|OR]"/> defaults mm to 0%
(OR) or 100% (AND)
#code4lib: issues with non-analyzed fields in qf
27. edismax
Supports full lucene query syntax in the absence of syntax errors
supports "and"/"or" to mean "AND"/"OR" in lucene syntax mode
When there are syntax errors, improved smart partial escaping of special characters is done to prevent
them... in this mode, fielded queries, +/-, and phrase queries are still supported.
Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words in
the document to get any boost, as well as having all of the words in a single field.
advanced stopword handling... stopwords are not required in the mandatory part of the query but are still
used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be)
then all will be required.
Supports the "boost" parameter.. like the dismax bf param, but multiplies the function query instead of
adding it in
Supports pure negative nested queries... so a query like +foo (-foo) will match all documents
28. function queries
termfreq, tf, docfreq, idf, norm, maxdoc, numdocs
{!func}termfreq(text,ipod)
standard java.util.Math functions
29. faceting
per-segment, single-valued fields:
facet.method=fcs (field cache per segment)
facet.field={!threads=-1}field_name
threads=0: direct execution
threads=-1: thread per segment
speeds up single and multivalued method=fc, especially for deep paging with
facet.offset
date faceting improvements, generalized for numeric ranges too
can now exclude main query q={!tag=main}the+query&facet.field={!ex=main}category
40. sort
by function
&q=*:*&sfield=store&pt=39.194564,-86.432947&
sort=geodist() asc
but still can't get value of function back
unless you force it to be the score somehow
56. Q&A: faceting
why is paging through facets the way it is?
short-circuits on enum
57. Community:
- The state of Extended DisMax, and what Lucene features
remain incompatible with it.
- Any developments on faceting (I've implemented the
standard workaround to the "unknown facet list size"
problem... but I'd still love to be able to know exactly how
long the lists are)
- Hierarchical documents in Solr -- I haven't followed the
conversations closely, but I gather that this topic is gaining
some momentum in the Solr community.
58. contact info
erik.hatcher @ lucidimagination . com
http://www.lucidimagination.com
webinars, documentation
LucidFind: search.lucidimagination.com
search mailing list posts, wiki pages, web
sites, our blog, etc for latest Lucene/Solr
assistance