SlideShare a Scribd company logo
1 of 19
Designing a generic python search
engine API
Richard Boulton
@rboulton
richard@cnav.co.uk
Lots of search engines
● Lucene, Xapian, Sphinx, Solr, ElasticSearch,
Whoosh, Riak Search, Terrier, Lemur/Indri
● Also MySQL, PostgreSQL Full Text
components
● Also client-side engines using Redis, Mongo,
etc.
Generic API?
● Don't know what search features you need in
advance
● So, don't want to be stuck with an early choice.
● Also, don't want to learn new API for trying out
new engine.
I Need Input
Philosophy
● Expose common features in standard way
● Emulate missing features
● Don't get in the way
● Build useful features on top
Backend variation
● Fields / Schemas: not in Xapian, Lucene
● Schema modification: Sphinx, Whoosh can't
Updates
● Sphinx: no updates (in progress)
● Does update happen synchronously?
● Do updates return docids, or do docids need to
be supplied by client?
● Can docids be set by client?
● When do updates become live?
Scaling
● Multiple database searches?
● Searching remote databases?
● Replication?
Analysers
● Vary wildly between backends
● Stemming, splitting, n-grams
● Language detection
● Fuzzy searches
● Soundex (well, Metaphone)
Queries
● Many different features available
● Most engines support arbitrary booleans
● Some have XOR!
● Some only permit sets of filters
● Weighting schemes
● Need to expose native backend query parsers
Facets
● Information about result set
● Can be emulated (slow)
● Some backends approximate
● Some backends give stats, histograms
Other features
● Spelling correction
● Numeric and Date range searches
● Geospatial searches (box, geohash, distance)
Proposed design
● SearchClient class for each backend.
● A definition of standard behaviours that all
backends should provide.
● A definition of optional behaviours when more
than one backend provides them.
Proposed design
● Test suite to ensure that all backends support
common features.
● Programmatic way of checking which features a
backend supports? (Or just raise exception)
Proposed design
● Convenience SearchClient factory function:
c = multisearch.SearchClient('xapian', path=dbpath)
Documents
● Must support dictionary of fields
● Unicode values
● List(unicode) values
● May support arbitrary other field types, or
different data structures, if backend wants to.
Schemas
● Fields have types
● Automatic type “guessing” (client or server side)
● Some standard minimal set of analysers
● Text in a language
● Untokenised values
● Don't define exact output; just intent of standard
analysers.
Search representation
● Abstract query representation
● Tree of python objects.
● Overloaded operators for boolean.
● Chainable methods.
● (have actually written this)
● SearchClient.my_query_type()
Code
● Such as it is, on
http://github.com/rboulton/multisearch
● Suggestions for a better name appreciated
● Query representation is pretty good, rest is
pretty rough.

More Related Content

What's hot

The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsItamar
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineDaniel N
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014Roy Russo
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphLucidworks
 
Relevancy hacks for eCommerce
Relevancy hacks for eCommerceRelevancy hacks for eCommerce
Relevancy hacks for eCommerceVarun Thacker
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma APIKyle Banerjee
 
Future of pandas
Future of pandasFuture of pandas
Future of pandasJeff Reback
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyItamar
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with RBarbara Fusinska
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearchJoey Wen
 
Www Search Engine But Not In Perl
Www Search Engine But Not In PerlWww Search Engine But Not In Perl
Www Search Engine But Not In PerlKonstantin Ivinsky
 
Elgg solr presentation
Elgg solr presentationElgg solr presentation
Elgg solr presentationbeck24
 

What's hot (19)

The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch plugins
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014ElasticSearch - DevNexus Atlanta - 2014
ElasticSearch - DevNexus Atlanta - 2014
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
 
Relevancy hacks for eCommerce
Relevancy hacks for eCommerceRelevancy hacks for eCommerce
Relevancy hacks for eCommerce
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Future of pandas
Future of pandasFuture of pandas
Future of pandas
 
The eZ Platform Query Field
The eZ Platform Query FieldThe eZ Platform Query Field
The eZ Platform Query Field
 
Elasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easyElasticsearch Distributed search & analytics on BigData made easy
Elasticsearch Distributed search & analytics on BigData made easy
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Www Search Engine But Not In Perl
Www Search Engine But Not In PerlWww Search Engine But Not In Perl
Www Search Engine But Not In Perl
 
Elgg solr presentation
Elgg solr presentationElgg solr presentation
Elgg solr presentation
 

Viewers also liked

Divoli & Medelyan: HCIR-2011 Presentation
Divoli & Medelyan: HCIR-2011 PresentationDivoli & Medelyan: HCIR-2011 Presentation
Divoli & Medelyan: HCIR-2011 PresentationAlyona Medelyan
 
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)David Chiu
 
SEO Best Practice Techniques
SEO Best Practice TechniquesSEO Best Practice Techniques
SEO Best Practice Techniquespixelbuilders
 
IMAGE COMPRESSION AND DECOMPRESSION SYSTEM
IMAGE COMPRESSION AND DECOMPRESSION SYSTEMIMAGE COMPRESSION AND DECOMPRESSION SYSTEM
IMAGE COMPRESSION AND DECOMPRESSION SYSTEMVishesh Banga
 
image compression using matlab project report
image compression  using matlab project reportimage compression  using matlab project report
image compression using matlab project reportkgaurav113
 
artificial neural network
artificial neural networkartificial neural network
artificial neural networkPallavi Yadav
 
Image processing and compression techniques
Image processing and compression techniquesImage processing and compression techniques
Image processing and compression techniquesAshwin Venkataraman
 
types of computer networks, protocols and standards
types of computer networks, protocols and standardstypes of computer networks, protocols and standards
types of computer networks, protocols and standardsMidhun Menon
 
neural network
neural networkneural network
neural networkSTUDENT
 
Evaluation question 2
Evaluation question 2Evaluation question 2
Evaluation question 2beerogers
 
Tools and Methodology for Research: Future of Science
Tools and Methodology for Research: Future of ScienceTools and Methodology for Research: Future of Science
Tools and Methodology for Research: Future of ScienceYannick Prié (Enseignement)
 
Supertech supernova,Supertech supernova
Supertech supernova,Supertech supernovaSupertech supernova,Supertech supernova
Supertech supernova,Supertech supernovaJagat Bharti
 
Q6. Evaluation.
Q6. Evaluation.Q6. Evaluation.
Q6. Evaluation.karleab
 

Viewers also liked (20)

Divoli & Medelyan: HCIR-2011 Presentation
Divoli & Medelyan: HCIR-2011 PresentationDivoli & Medelyan: HCIR-2011 Presentation
Divoli & Medelyan: HCIR-2011 Presentation
 
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
新聞 X 謊言 用文字探勘挖掘財經新聞沒告訴你的真相(丘祐瑋)
 
SEO Best Practice Techniques
SEO Best Practice TechniquesSEO Best Practice Techniques
SEO Best Practice Techniques
 
IMAGE COMPRESSION AND DECOMPRESSION SYSTEM
IMAGE COMPRESSION AND DECOMPRESSION SYSTEMIMAGE COMPRESSION AND DECOMPRESSION SYSTEM
IMAGE COMPRESSION AND DECOMPRESSION SYSTEM
 
image compression using matlab project report
image compression  using matlab project reportimage compression  using matlab project report
image compression using matlab project report
 
Application Layer
Application LayerApplication Layer
Application Layer
 
artificial neural network
artificial neural networkartificial neural network
artificial neural network
 
Image processing and compression techniques
Image processing and compression techniquesImage processing and compression techniques
Image processing and compression techniques
 
types of computer networks, protocols and standards
types of computer networks, protocols and standardstypes of computer networks, protocols and standards
types of computer networks, protocols and standards
 
image compression ppt
image compression pptimage compression ppt
image compression ppt
 
neural network
neural networkneural network
neural network
 
15
1515
15
 
5784
57845784
5784
 
Igualdad libertad
Igualdad libertadIgualdad libertad
Igualdad libertad
 
Evaluation question 2
Evaluation question 2Evaluation question 2
Evaluation question 2
 
Tools and Methodology for Research: Future of Science
Tools and Methodology for Research: Future of ScienceTools and Methodology for Research: Future of Science
Tools and Methodology for Research: Future of Science
 
Supertech supernova,Supertech supernova
Supertech supernova,Supertech supernovaSupertech supernova,Supertech supernova
Supertech supernova,Supertech supernova
 
Prueba
PruebaPrueba
Prueba
 
chi phi thuc te khi du hoc nhat ban
chi phi thuc te khi du hoc nhat banchi phi thuc te khi du hoc nhat ban
chi phi thuc te khi du hoc nhat ban
 
Q6. Evaluation.
Q6. Evaluation.Q6. Evaluation.
Q6. Evaluation.
 

Similar to Designing a generic Python Search Engine API - BarCampLondon 8

Becoming "Facet"-nated with Search API
Becoming "Facet"-nated with Search APIBecoming "Facet"-nated with Search API
Becoming "Facet"-nated with Search APIcgmonroe
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Gabriele Bartolini
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@GrabShubham Tagra
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastoreTomas Sirny
 
PostgreSQL and Sphinx pgcon 2013
PostgreSQL and Sphinx   pgcon 2013PostgreSQL and Sphinx   pgcon 2013
PostgreSQL and Sphinx pgcon 2013Emanuel Calvo
 
An evening with Postgresql
An evening with PostgresqlAn evening with Postgresql
An evening with PostgresqlJoshua Drake
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional ProgrammerDave Cross
 
GraphQL is actually rest
GraphQL is actually restGraphQL is actually rest
GraphQL is actually restJakub Riedl
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxMichael Hackstein
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "GlobalLogic Ukraine
 
Ducksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Android developer fundamentals training overview Part II
Android developer fundamentals training overview Part IIAndroid developer fundamentals training overview Part II
Android developer fundamentals training overview Part IIYoza Aprilio
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
 
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012Blend Interactive
 

Similar to Designing a generic Python Search Engine API - BarCampLondon 8 (20)

Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'
 
Becoming "Facet"-nated with Search API
Becoming "Facet"-nated with Search APIBecoming "Facet"-nated with Search API
Becoming "Facet"-nated with Search API
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@Grab
 
ElasticSearch as (only) datastore
ElasticSearch as (only) datastoreElasticSearch as (only) datastore
ElasticSearch as (only) datastore
 
PostgreSQL and Sphinx pgcon 2013
PostgreSQL and Sphinx   pgcon 2013PostgreSQL and Sphinx   pgcon 2013
PostgreSQL and Sphinx pgcon 2013
 
An evening with Postgresql
An evening with PostgresqlAn evening with Postgresql
An evening with Postgresql
 
The Professional Programmer
The Professional ProgrammerThe Professional Programmer
The Professional Programmer
 
GraphQL is actually rest
GraphQL is actually restGraphQL is actually rest
GraphQL is actually rest
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "
 
Ducksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architecture
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Android developer fundamentals training overview Part II
Android developer fundamentals training overview Part IIAndroid developer fundamentals training overview Part II
Android developer fundamentals training overview Part II
 
Discovering python search engines
Discovering python search enginesDiscovering python search engines
Discovering python search engines
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012
"Building Modern PHP Applications" - Jackson Murtha, South Dakota Code Camp 2012
 

More from Richard Boulton

Improving relevance with log information
Improving relevance with log informationImproving relevance with log information
Improving relevance with log informationRichard Boulton
 
Making a simple question into a complicated query
Making a simple question into a complicated queryMaking a simple question into a complicated query
Making a simple question into a complicated queryRichard Boulton
 
Search as a Service with Xapian - Search Solutions 2009
Search as a Service with Xapian - Search Solutions 2009Search as a Service with Xapian - Search Solutions 2009
Search as a Service with Xapian - Search Solutions 2009Richard Boulton
 
Comparing open source search engines
Comparing open source search enginesComparing open source search engines
Comparing open source search enginesRichard Boulton
 
The Xapian Open Source Search Engine
The Xapian Open Source Search EngineThe Xapian Open Source Search Engine
The Xapian Open Source Search EngineRichard Boulton
 

More from Richard Boulton (8)

Improving relevance with log information
Improving relevance with log informationImproving relevance with log information
Improving relevance with log information
 
Making a simple question into a complicated query
Making a simple question into a complicated queryMaking a simple question into a complicated query
Making a simple question into a complicated query
 
Interfaces to xapian
Interfaces to xapianInterfaces to xapian
Interfaces to xapian
 
Haystack
HaystackHaystack
Haystack
 
Search as a Service with Xapian - Search Solutions 2009
Search as a Service with Xapian - Search Solutions 2009Search as a Service with Xapian - Search Solutions 2009
Search as a Service with Xapian - Search Solutions 2009
 
Comparing open source search engines
Comparing open source search enginesComparing open source search engines
Comparing open source search engines
 
Optimising Xapian
Optimising XapianOptimising Xapian
Optimising Xapian
 
The Xapian Open Source Search Engine
The Xapian Open Source Search EngineThe Xapian Open Source Search Engine
The Xapian Open Source Search Engine
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Designing a generic Python Search Engine API - BarCampLondon 8

  • 1. Designing a generic python search engine API Richard Boulton @rboulton richard@cnav.co.uk
  • 2. Lots of search engines ● Lucene, Xapian, Sphinx, Solr, ElasticSearch, Whoosh, Riak Search, Terrier, Lemur/Indri ● Also MySQL, PostgreSQL Full Text components ● Also client-side engines using Redis, Mongo, etc.
  • 3. Generic API? ● Don't know what search features you need in advance ● So, don't want to be stuck with an early choice. ● Also, don't want to learn new API for trying out new engine.
  • 5. Philosophy ● Expose common features in standard way ● Emulate missing features ● Don't get in the way ● Build useful features on top
  • 6. Backend variation ● Fields / Schemas: not in Xapian, Lucene ● Schema modification: Sphinx, Whoosh can't
  • 7. Updates ● Sphinx: no updates (in progress) ● Does update happen synchronously? ● Do updates return docids, or do docids need to be supplied by client? ● Can docids be set by client? ● When do updates become live?
  • 8. Scaling ● Multiple database searches? ● Searching remote databases? ● Replication?
  • 9. Analysers ● Vary wildly between backends ● Stemming, splitting, n-grams ● Language detection ● Fuzzy searches ● Soundex (well, Metaphone)
  • 10. Queries ● Many different features available ● Most engines support arbitrary booleans ● Some have XOR! ● Some only permit sets of filters ● Weighting schemes ● Need to expose native backend query parsers
  • 11. Facets ● Information about result set ● Can be emulated (slow) ● Some backends approximate ● Some backends give stats, histograms
  • 12. Other features ● Spelling correction ● Numeric and Date range searches ● Geospatial searches (box, geohash, distance)
  • 13. Proposed design ● SearchClient class for each backend. ● A definition of standard behaviours that all backends should provide. ● A definition of optional behaviours when more than one backend provides them.
  • 14. Proposed design ● Test suite to ensure that all backends support common features. ● Programmatic way of checking which features a backend supports? (Or just raise exception)
  • 15. Proposed design ● Convenience SearchClient factory function: c = multisearch.SearchClient('xapian', path=dbpath)
  • 16. Documents ● Must support dictionary of fields ● Unicode values ● List(unicode) values ● May support arbitrary other field types, or different data structures, if backend wants to.
  • 17. Schemas ● Fields have types ● Automatic type “guessing” (client or server side) ● Some standard minimal set of analysers ● Text in a language ● Untokenised values ● Don't define exact output; just intent of standard analysers.
  • 18. Search representation ● Abstract query representation ● Tree of python objects. ● Overloaded operators for boolean. ● Chainable methods. ● (have actually written this) ● SearchClient.my_query_type()
  • 19. Code ● Such as it is, on http://github.com/rboulton/multisearch ● Suggestions for a better name appreciated ● Query representation is pretty good, rest is pretty rough.