SlideShare uma empresa Scribd logo
1 de 34
Presented by Erik Hatcher
11 June 2013
São Paulo Lucene Meetup
Query Parsing
Interpreting what the user
meant and what they ideally
would like to find is tricky
business. This talk will cover useful
tips and tricks to better leverage and
extend Solr's analysis and query
parsing capabilities to more richly
parse and interpret user queries.
http://wiki.apache.org/solr/QueryParser
• Surround query parser
• _query_:”{!...}” ugliness now unnecessary
• just use nested {!...} expressions
• Version 4.2: "switch" query parser
What’s new in 4.x?
• FieldType awareness
• range queries, numerics
• allows date math
• reverses wildcard terms, if indexing used ReverseWildcardFilter
• Magic fields
• _val_: function query injection
• _query_: nested query, to use a different query parser
• Multi-term analysis (type="multiterm")
• Analyzes prefix, wildcard, regex expressions to normalize diacritics,
lowercase, etc
• If not explicitly defined, all MultiTermAwareComponent's from query
analyzer are used, or KeywordTokenizer for effectively no analysis
"lucene"-syntax query parser
• Simple constrained syntax
• "supports phrases" +requiredTerms -prohibitedTerms loose terms
• Spreads terms across specified query fields (qf) and entire query string
across phrase fields (pf)
• with field-specific boosting
• and explicit and implicit phrase slop
• scores each document with the maximum score for that document as produced
by any subquery; primary score associated with the highest boost, not the sum
of the field scores (as BooleanQuery would give)
• Minimum match (mm) allows query fields gradient between AND and OR
• some number of terms must match, but not all necessarily, and can vary
depending on number of actual query terms
• Additive boost queries (bq) and boost functions (bf)
• Debug output includes parsed boost and function queries
dismax
• defType=parser_name
• defines main query parser
• {!parser_name local=param...}expression
• Can specify parser per query expression
• These are equivalent:
• q=ApacheCon NA
2013&defType=dismax&mm=2&qf=name
• q={!dismax qf=name mm=2}ApacheCon NA 2013
• q={!dismax qf=name mm=2 v='ApacheCon NA 2013'}
Specifying the query parser
Local Parameter Substitution
• Leverages the "lucene" query parser's _query_/{!...} trick
• Example:
• q={!dismax qf='title^2 body' v=$user_query} AND
{!dismax qf='keywords^5 description^2' v=$topic}
• &user_query=ApacheCon NA 2013
• &topic=events
• Setting the complex nested q parameter in a request handler can make the client request
lean and clean
• And even qf and other parameters can be substituted:
• {!dismax qf=$title_qf pf=$title_pf v=$title_query}
• &title_qf=title^5 subtitle^2...
• Real world example, Stanford University Libraries:
• http://searchworks.stanford.edu/advanced
• Insanely complex sets of nested dismax's and qf/pf settings
Nested Query Parsing
• "An advanced multi-field query parser based on the dismax parser"
• Handles "lucene" syntax as well as dismax features
• Fields available to user may be limited (uf)
• including negations and dynamic fields, e.g. uf=* -cost -timestamp
• Shingles query into 2 and 3 term phrases
• Improves quality of results when query contains terms across multiple fields
• pf2/pf3 and ps2/ps3
• removes stop words from shingled phrase queries
• multiplicative "boost" functions
• Additional features
• Query comprised entirely of "stopwords" optionally allowed
• if indexed, but query analyzer is set to remove them
• Allow "lowercaseOperators" by default; or/OR, and/AND
edismax: extended dismax
• FieldType aware, no analysis
• converts to internal representation automatically
• "raw" query parser is similar
• though raw parser is not field type aware; no internal
representation conversion
• Best practice for filtering on single facet value
• fq={!term f=facet_field}crazy:value :)
• no query string escaping needed; but of course still
need URL encoding when appropriate
term query parser
• No field type awareness
• {!prefix f=field_name}prefixValue
• Similar to Lucene query parser
field_name:prefixValue*
• Solr's "lucene" query parser has multiterm
analysis capability, but the prefix query
parser does not analyze
prefix query parser
• Multiplicative to wrapped query score
• Internally used by edismax "boost"
• {!boost b=recip(ms(NOW,mydatefield),
3.16e-11,1,1)}foo
boost query parser
• Same as handling of field:"Some Text" clause by Solr's "lucene" query
parser
• FieldType aware
• TermQuery generated, unless field type has special handling
• TextField
• PhraseQuery: if multiple tokens in different positions
• MultiPhraseQuery: if multiple tokens share some positions
• BooleanQuery: if multiple terms all in same position
• TermQuery: if only a single token
• Other types that handle field queries specially:
• currency, spatial types (point, latlon, etc)
• {!field f=location}49.25,8.883333
field query parser
• Creates Lucene SpanQuery's for fine-grained proximity matching, including use of
wildcards
• Uses infix and prefix notation
• infix:AND/OR/NOT/nW/nN/()
• prefix:AND/OR/nW/nN
• Supports Lucene query parser basics
• field:value, boost^5, wild?c*rd, prefix*
• Proximity operators:
• N: ordered
• W: unordered
• No analysis of clauses
• requires user or search client to lowercase, normalize, etc
• Example:
• q={!surround}Apache* 4w Portland
surround query parser
• Pseudo-join
• Field values from inner result set used to map to another field to select final result
set
• No information from inner result set carries to final result set, such as scores or
field values (it's not SQL!)
• Can join from another local Solr core
• Allows for different types of entities to be indexed in separate indexes altogether,
modeled into clean schemas
• Separate cores can scale independently, especially with commit and warming issues
• Syntax:
• {!join from=... to=... [fromIndex=core_name]}query
• For more information:
• Yonik's Lucene Revolution 2011 presentation: http://vimeo.com/25015101
• http://wiki.apache.org/solr/Join
join query parser
• Operates on geohash, latlon, and point types
• geofilt
• Exact distance filtering
• fq={!geofilt sfield=location pt=10.312,-20.556 d=3.5}
• bbox
• Alternatively use a range query:
• fq=location:[45,-94 TO 46,-93]
• Can use in conjunction with geodist() function
• Sorting:
• sort=geodist() asc
• Returning distance:
• fl=_dist_:geodist()
spatial query parsers
• Match a field term range, textual or numeric
• Example:
• fq={!frange l=0 u=2.2}
sum(user_ranking,editor_ranking)
frange: function range
• acts like a "switch/case" statement
• Example:
• fq={!switch
case.all='*:*'
case.yes='inStock:true'
case.no='inStock:false'
v=$in_stock}
• &in_stock=yes
• Solr 4.2+
switch query parser
• Query's implementing PostFilter interface consulted after query and
all other filters have narrowed documents for consideration
• Queries supporting PostFilter
• frange, geofilt, bbox
• Enabled by setting cache=false and cost >= 100
• Example:
• fq={!frange l=5 cache=false cost=200}div(log(popularity),sqrt(geodist()))
• More info:
• Advanced filter caching
• http://searchhub.org/2012/02/10/advanced-filter-caching-in-solr/
• Custom security filtering
• http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/
PostFilter
• Users tend to expect loose matching
• but with "more exact" matches ranked higher
• Various mechanisms for loosening matching:
• Phonetic sounds-like: cat/kat, similar/similer
• Stemming: search/searches/searched/searching
• Synonyms: cat/feline, dog/canine
• Distinguish ranking between exact and looser matching:
• copyField original to a new (unstored, yet indexed) field with desired
looser matching analysis
• query across original field and looser field, with higher boosting for
original field
• /select?q=ApatchyCon&defType=dismax&qf=name^5 name_phonetic
Phonetic, Stem, Synonym
• Model It AsYou Need It
• Leverage Lucene's Document/Field/Query/score & sort & highlight
• Example 1: Selling automobile parts
• Exact year/make/model is needed to pick the right parts
• Suggest a vehicle as user types
• from the main parts index: tricky, requires lots of special fields and analysis
tricks and even then you're suggesting fields from "parts"
• Another (better?) approach: model vehicles as a separate core, "search"
when suggesting, return documents, not field terms
• Example 2:Technical Conferences
• /select?q=Con&wt=csv&fl=name
• Lucene EuroCon
• ApacheCon
Suggest things, not strings
• The query is the formula that determines each
document's score
• Tuning is about what your application needs
• Build tests using your corpus and real-world queries
and ranking expectations
• Re-run tests frequently/continuously as query
parameters are tweaked
• Tooling, currently, is mostly in-house custom
• but that's changing, stay tuned!
Query parsing and relevancy
• Analysis
• /analysis/field
• ?analysis.fieldname=name
• &analysis.fieldvalue=NA ApacheCon 2013
• &q=apachecon
• &analysis.showmatch=true
• Also /analysis/document
• admin UI analysis tool
• Query Parsing
• &debug=query
• Relevancy
• &debug=results
• shows scoring explanations
Development/
troubleshooting
• JSON query parser
• XML query parser
• PayloadTermQuery parser
• Bitwise qparser
Future of Solr query
parsing
• https://issues.apache.org/jira/browse/SOLR-4351
• Current patch enables these:
• {'term':{'id':'13'}}
• {'field':{'text':'ApacheCon'}}
• {'frange':{'v':'mul(rating,2)', 'l':20,'u':24}}}
• {'join':{'from':'book_id', 'to':'id', 'v':{'term':
{'text':'search'}}}}
JSON query parser
• Will allow a rich query "tree"
• Parameters will fill in variables in a server-
side XSLT query tree definition, or can
provide full query tree
• Useful for "advanced" query, multi-valued,
input
• https://issues.apache.org/jira/browse/
SOLR-839
XML query parser
• Solr supports indexing payload data on terms
using DelimitedPayloadTokenFilter, but currently
no support for querying with payloads
• Requires custom Similarity implementation to
provide score factor for payload data
• Allows index-time weighting of terms
• e.g. <b>bold words</b> weighted higher
• https://issues.apache.org/jira/browse/SOLR-1485
Payload term query parser
• https://issues.apache.org/jira/browse/SOLR-3076
• Lucene provides a way to index a hierarchical
"block" of documents and query it using
ToParentBlockJoinQuery and
ToChildBlockJoinQuery
• Indexing a block is not yet supported by Solr
• Example use case:What books greater than 100
pages have paragraphs containing "information
retrieval"?
BlockJoinQuery
Bitwise query parser
• https://issues.apache.org/jira/browse/
SOLR-1913
• {!bitwise field=user_permissions op=AND
source=3 negate=true}state:FL
• op one of AND, OR, or XOR
• http://www.lucidworks.com
• Slides: http://www.slideshare.net/erikhatcher
• Twitter: @erikhatcher
• erik . hatcher @ lucidworks . com

Mais conteúdo relacionado

Mais de Erik Hatcher

Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query ParsingErik Hatcher
 
"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - ChicagoErik Hatcher
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and TricksErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0Erik Hatcher
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 

Mais de Erik Hatcher (20)

Solr Payloads
Solr PayloadsSolr Payloads
Solr Payloads
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago"Solr Update" at code4lib '13 - Chicago
"Solr Update" at code4lib '13 - Chicago
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
Solr 4
Solr 4Solr 4
Solr 4
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Solr Flair
Solr FlairSolr Flair
Solr Flair
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 

Último

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

São Paulo Lucene Meetup: Solr Query Parsing

  • 1. Presented by Erik Hatcher 11 June 2013 São Paulo Lucene Meetup Query Parsing
  • 2.
  • 3. Interpreting what the user meant and what they ideally would like to find is tricky business. This talk will cover useful tips and tricks to better leverage and extend Solr's analysis and query parsing capabilities to more richly parse and interpret user queries.
  • 5.
  • 6.
  • 7. • Surround query parser • _query_:”{!...}” ugliness now unnecessary • just use nested {!...} expressions • Version 4.2: "switch" query parser What’s new in 4.x?
  • 8. • FieldType awareness • range queries, numerics • allows date math • reverses wildcard terms, if indexing used ReverseWildcardFilter • Magic fields • _val_: function query injection • _query_: nested query, to use a different query parser • Multi-term analysis (type="multiterm") • Analyzes prefix, wildcard, regex expressions to normalize diacritics, lowercase, etc • If not explicitly defined, all MultiTermAwareComponent's from query analyzer are used, or KeywordTokenizer for effectively no analysis "lucene"-syntax query parser
  • 9. • Simple constrained syntax • "supports phrases" +requiredTerms -prohibitedTerms loose terms • Spreads terms across specified query fields (qf) and entire query string across phrase fields (pf) • with field-specific boosting • and explicit and implicit phrase slop • scores each document with the maximum score for that document as produced by any subquery; primary score associated with the highest boost, not the sum of the field scores (as BooleanQuery would give) • Minimum match (mm) allows query fields gradient between AND and OR • some number of terms must match, but not all necessarily, and can vary depending on number of actual query terms • Additive boost queries (bq) and boost functions (bf) • Debug output includes parsed boost and function queries dismax
  • 10. • defType=parser_name • defines main query parser • {!parser_name local=param...}expression • Can specify parser per query expression • These are equivalent: • q=ApacheCon NA 2013&defType=dismax&mm=2&qf=name • q={!dismax qf=name mm=2}ApacheCon NA 2013 • q={!dismax qf=name mm=2 v='ApacheCon NA 2013'} Specifying the query parser
  • 12. • Leverages the "lucene" query parser's _query_/{!...} trick • Example: • q={!dismax qf='title^2 body' v=$user_query} AND {!dismax qf='keywords^5 description^2' v=$topic} • &user_query=ApacheCon NA 2013 • &topic=events • Setting the complex nested q parameter in a request handler can make the client request lean and clean • And even qf and other parameters can be substituted: • {!dismax qf=$title_qf pf=$title_pf v=$title_query} • &title_qf=title^5 subtitle^2... • Real world example, Stanford University Libraries: • http://searchworks.stanford.edu/advanced • Insanely complex sets of nested dismax's and qf/pf settings Nested Query Parsing
  • 13. • "An advanced multi-field query parser based on the dismax parser" • Handles "lucene" syntax as well as dismax features • Fields available to user may be limited (uf) • including negations and dynamic fields, e.g. uf=* -cost -timestamp • Shingles query into 2 and 3 term phrases • Improves quality of results when query contains terms across multiple fields • pf2/pf3 and ps2/ps3 • removes stop words from shingled phrase queries • multiplicative "boost" functions • Additional features • Query comprised entirely of "stopwords" optionally allowed • if indexed, but query analyzer is set to remove them • Allow "lowercaseOperators" by default; or/OR, and/AND edismax: extended dismax
  • 14. • FieldType aware, no analysis • converts to internal representation automatically • "raw" query parser is similar • though raw parser is not field type aware; no internal representation conversion • Best practice for filtering on single facet value • fq={!term f=facet_field}crazy:value :) • no query string escaping needed; but of course still need URL encoding when appropriate term query parser
  • 15. • No field type awareness • {!prefix f=field_name}prefixValue • Similar to Lucene query parser field_name:prefixValue* • Solr's "lucene" query parser has multiterm analysis capability, but the prefix query parser does not analyze prefix query parser
  • 16. • Multiplicative to wrapped query score • Internally used by edismax "boost" • {!boost b=recip(ms(NOW,mydatefield), 3.16e-11,1,1)}foo boost query parser
  • 17. • Same as handling of field:"Some Text" clause by Solr's "lucene" query parser • FieldType aware • TermQuery generated, unless field type has special handling • TextField • PhraseQuery: if multiple tokens in different positions • MultiPhraseQuery: if multiple tokens share some positions • BooleanQuery: if multiple terms all in same position • TermQuery: if only a single token • Other types that handle field queries specially: • currency, spatial types (point, latlon, etc) • {!field f=location}49.25,8.883333 field query parser
  • 18. • Creates Lucene SpanQuery's for fine-grained proximity matching, including use of wildcards • Uses infix and prefix notation • infix:AND/OR/NOT/nW/nN/() • prefix:AND/OR/nW/nN • Supports Lucene query parser basics • field:value, boost^5, wild?c*rd, prefix* • Proximity operators: • N: ordered • W: unordered • No analysis of clauses • requires user or search client to lowercase, normalize, etc • Example: • q={!surround}Apache* 4w Portland surround query parser
  • 19. • Pseudo-join • Field values from inner result set used to map to another field to select final result set • No information from inner result set carries to final result set, such as scores or field values (it's not SQL!) • Can join from another local Solr core • Allows for different types of entities to be indexed in separate indexes altogether, modeled into clean schemas • Separate cores can scale independently, especially with commit and warming issues • Syntax: • {!join from=... to=... [fromIndex=core_name]}query • For more information: • Yonik's Lucene Revolution 2011 presentation: http://vimeo.com/25015101 • http://wiki.apache.org/solr/Join join query parser
  • 20. • Operates on geohash, latlon, and point types • geofilt • Exact distance filtering • fq={!geofilt sfield=location pt=10.312,-20.556 d=3.5} • bbox • Alternatively use a range query: • fq=location:[45,-94 TO 46,-93] • Can use in conjunction with geodist() function • Sorting: • sort=geodist() asc • Returning distance: • fl=_dist_:geodist() spatial query parsers
  • 21. • Match a field term range, textual or numeric • Example: • fq={!frange l=0 u=2.2} sum(user_ranking,editor_ranking) frange: function range
  • 22. • acts like a "switch/case" statement • Example: • fq={!switch case.all='*:*' case.yes='inStock:true' case.no='inStock:false' v=$in_stock} • &in_stock=yes • Solr 4.2+ switch query parser
  • 23. • Query's implementing PostFilter interface consulted after query and all other filters have narrowed documents for consideration • Queries supporting PostFilter • frange, geofilt, bbox • Enabled by setting cache=false and cost >= 100 • Example: • fq={!frange l=5 cache=false cost=200}div(log(popularity),sqrt(geodist())) • More info: • Advanced filter caching • http://searchhub.org/2012/02/10/advanced-filter-caching-in-solr/ • Custom security filtering • http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/ PostFilter
  • 24. • Users tend to expect loose matching • but with "more exact" matches ranked higher • Various mechanisms for loosening matching: • Phonetic sounds-like: cat/kat, similar/similer • Stemming: search/searches/searched/searching • Synonyms: cat/feline, dog/canine • Distinguish ranking between exact and looser matching: • copyField original to a new (unstored, yet indexed) field with desired looser matching analysis • query across original field and looser field, with higher boosting for original field • /select?q=ApatchyCon&defType=dismax&qf=name^5 name_phonetic Phonetic, Stem, Synonym
  • 25. • Model It AsYou Need It • Leverage Lucene's Document/Field/Query/score & sort & highlight • Example 1: Selling automobile parts • Exact year/make/model is needed to pick the right parts • Suggest a vehicle as user types • from the main parts index: tricky, requires lots of special fields and analysis tricks and even then you're suggesting fields from "parts" • Another (better?) approach: model vehicles as a separate core, "search" when suggesting, return documents, not field terms • Example 2:Technical Conferences • /select?q=Con&wt=csv&fl=name • Lucene EuroCon • ApacheCon Suggest things, not strings
  • 26. • The query is the formula that determines each document's score • Tuning is about what your application needs • Build tests using your corpus and real-world queries and ranking expectations • Re-run tests frequently/continuously as query parameters are tweaked • Tooling, currently, is mostly in-house custom • but that's changing, stay tuned! Query parsing and relevancy
  • 27. • Analysis • /analysis/field • ?analysis.fieldname=name • &analysis.fieldvalue=NA ApacheCon 2013 • &q=apachecon • &analysis.showmatch=true • Also /analysis/document • admin UI analysis tool • Query Parsing • &debug=query • Relevancy • &debug=results • shows scoring explanations Development/ troubleshooting
  • 28. • JSON query parser • XML query parser • PayloadTermQuery parser • Bitwise qparser Future of Solr query parsing
  • 29. • https://issues.apache.org/jira/browse/SOLR-4351 • Current patch enables these: • {'term':{'id':'13'}} • {'field':{'text':'ApacheCon'}} • {'frange':{'v':'mul(rating,2)', 'l':20,'u':24}}} • {'join':{'from':'book_id', 'to':'id', 'v':{'term': {'text':'search'}}}} JSON query parser
  • 30. • Will allow a rich query "tree" • Parameters will fill in variables in a server- side XSLT query tree definition, or can provide full query tree • Useful for "advanced" query, multi-valued, input • https://issues.apache.org/jira/browse/ SOLR-839 XML query parser
  • 31. • Solr supports indexing payload data on terms using DelimitedPayloadTokenFilter, but currently no support for querying with payloads • Requires custom Similarity implementation to provide score factor for payload data • Allows index-time weighting of terms • e.g. <b>bold words</b> weighted higher • https://issues.apache.org/jira/browse/SOLR-1485 Payload term query parser
  • 32. • https://issues.apache.org/jira/browse/SOLR-3076 • Lucene provides a way to index a hierarchical "block" of documents and query it using ToParentBlockJoinQuery and ToChildBlockJoinQuery • Indexing a block is not yet supported by Solr • Example use case:What books greater than 100 pages have paragraphs containing "information retrieval"? BlockJoinQuery
  • 33. Bitwise query parser • https://issues.apache.org/jira/browse/ SOLR-1913 • {!bitwise field=user_permissions op=AND source=3 negate=true}state:FL • op one of AND, OR, or XOR
  • 34. • http://www.lucidworks.com • Slides: http://www.slideshare.net/erikhatcher • Twitter: @erikhatcher • erik . hatcher @ lucidworks . com