Designing a generic Python Search Engine API - BarCampLondon 8
1. Designing a generic python search
engine API
Richard Boulton
@rboulton
richard@cnav.co.uk
2. Lots of search engines
● Lucene, Xapian, Sphinx, Solr, ElasticSearch,
Whoosh, Riak Search, Terrier, Lemur/Indri
● Also MySQL, PostgreSQL Full Text
components
● Also client-side engines using Redis, Mongo,
etc.
3. Generic API?
● Don't know what search features you need in
advance
● So, don't want to be stuck with an early choice.
● Also, don't want to learn new API for trying out
new engine.
7. Updates
● Sphinx: no updates (in progress)
● Does update happen synchronously?
● Do updates return docids, or do docids need to
be supplied by client?
● Can docids be set by client?
● When do updates become live?
10. Queries
● Many different features available
● Most engines support arbitrary booleans
● Some have XOR!
● Some only permit sets of filters
● Weighting schemes
● Need to expose native backend query parsers
11. Facets
● Information about result set
● Can be emulated (slow)
● Some backends approximate
● Some backends give stats, histograms
12. Other features
● Spelling correction
● Numeric and Date range searches
● Geospatial searches (box, geohash, distance)
13. Proposed design
● SearchClient class for each backend.
● A definition of standard behaviours that all
backends should provide.
● A definition of optional behaviours when more
than one backend provides them.
14. Proposed design
● Test suite to ensure that all backends support
common features.
● Programmatic way of checking which features a
backend supports? (Or just raise exception)
16. Documents
● Must support dictionary of fields
● Unicode values
● List(unicode) values
● May support arbitrary other field types, or
different data structures, if backend wants to.
17. Schemas
● Fields have types
● Automatic type “guessing” (client or server side)
● Some standard minimal set of analysers
● Text in a language
● Untokenised values
● Don't define exact output; just intent of standard
analysers.
18. Search representation
● Abstract query representation
● Tree of python objects.
● Overloaded operators for boolean.
● Chainable methods.
● (have actually written this)
● SearchClient.my_query_type()
19. Code
● Such as it is, on
http://github.com/rboulton/multisearch
● Suggestions for a better name appreciated
● Query representation is pretty good, rest is
pretty rough.