SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
PgConf EU 2014 presents 
Javier Ramirez 
* in * 
PostgreSQL 
Full-text search 
demystified 
@supercoco9 
https://teowaki.com
The problem
our architecture
One does not simply 
SELECT * from stuff where 
content ilike '%postgresql%'
Basic search features 
* stemmers (run, runner, running) 
* unaccented (josé, jose) 
* results highlighting 
* rank results by relevance
Nice to have features 
* partial searches 
* search operators (OR, AND...) 
* synonyms (postgres, postgresql, pgsql) 
* thesaurus (OS=Operating System) 
* fast, and space-efficient 
* debugging
Good News: 
PostgreSQL supports all 
the requested features
Bad News: 
unless you already know about search 
engines, the official docs are not obvious
How a search engine works 
* An indexing phase 
* A search phase
The indexing phase 
Convert the input text to tokens
The search phase 
Match the search terms to 
the indexed tokens
indexing in depth 
* choose an index format 
* tokenize the words 
* apply token analysis/filters 
* discard unwanted tokens
the index format 
* r-tree (GIST in PostgreSQL) 
* inverse indexes (GIN in PostgreSQL) 
* dynamic/distributed indexes
dynamic indexes: segmentation 
* sometimes the token index is 
segmented to allow faster updates 
* consolidate segments to speed-up 
search and account for deletions
tokenizing 
* parse/strip/convert format 
* normalize terms (unaccent, ascii, 
charsets, case folding, number precision..)
token analysis/filters 
* find synonyms 
* expand thesaurus 
* stem (maybe in different languages)
more token analysis/filters 
* eliminate stopwords 
* store word distance/frequency 
* store the full contents of some fields 
* store some fields as attributes/facets
“the index file” is really 
* a token file, probably segmented/distributed 
* some dictionary files: synonyms, thesaurus, 
stopwords, stems/lexems (in different languages) 
* word distance/frequency info 
* attributes/original field files 
* optional geospatial index 
* auxiliary files: word/sentence boundaries, meta-info, 
parser definitions, datasource definitions...
the hardest 
part is now 
over
searching in depth 
* tokenize/analyse 
* prepare operators 
* retrieve information 
* rank the results 
* highlight the matched parts
searching in depth: tokenize 
normalize, tokenize, and analyse 
the original search term 
the result would be a tokenized, stemmed, 
“synonymised” term, without stopwords
searching in depth: operators 
* partial search 
* logical/geospatial/range operators 
* in-sentence/in-paragraph/word distance 
* faceting/grouping
searching in depth: retrieval 
Go through the token index files, use the 
attributes and geospatial files if necessary 
for operators and/or grouping 
You might need to do this in a distributed way
searching in depth: ranking 
algorithm to sort the most relevant results: 
* field weights 
* word frequency/density 
* geospatial or timestamp ranking 
* ad-hoc ranking strategies
searching in depth: highlighting 
Mark the matching parts of the results 
It can be tricky/slow if you are not storing the full contents 
in your indexes
PostgreSQL as a 
full-text 
search engine
search features 
* index format configuration 
* partial search 
* word boundaries parser (not configurable) 
* stemmers/synonyms/thesaurus/stopwords 
* full-text logical operators 
* attributes/geo/timestamp/range (using SQL) 
* ranking strategies 
* highlighting 
* debugging/testing commands
indexing in postgresql 
you don't actually need an index to use full-text search in PostgreSQL 
but unless your db is very small, you want to have one 
Choose GIST or GIN (faster search, slower indexing, 
larger index size) 
CREATE INDEX pgweb_idx ON pgweb USING 
gin(to_tsvector(config_name, body));
Two new things 
CREATE INDEX ... USING gin(to_tsvector (config_name, body)); 
* to_tsvector: postgresql way of saying “tokenize” 
* config_name: tokenizing/analysis rule set
Configuration 
CREATE TEXT SEARCH CONFIGURATION 
public.teowaki ( COPY = pg_catalog.english );
Configuration 
CREATE TEXT SEARCH DICTIONARY english_ispell ( 
TEMPLATE = ispell, 
DictFile = en_us, 
AffFile = en_us, 
StopWords = spanglish 
); 
CREATE TEXT SEARCH DICTIONARY spanish_ispell ( 
TEMPLATE = ispell, 
DictFile = es_any, 
AffFile = es_any, 
StopWords = spanish 
);
Configuration 
CREATE TEXT SEARCH DICTIONARY english_stem ( 
TEMPLATE = snowball, 
Language = english, 
StopWords = english 
); 
CREATE TEXT SEARCH DICTIONARY spanish_stem ( 
TEMPLATE= snowball, 
Language = spanish, 
Stopwords = spanish 
);
Configuration 
Parser. 
Word boundaries
Configuration 
Assign dictionaries (in specific to generic order) 
ALTER TEXT SEARCH CONFIGURATION teowaki 
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, 
hword_part 
WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem; 
ALTER TEXT SEARCH CONFIGURATION teowaki 
DROP MAPPING FOR email, url, url_path, sfloat, float;
debugging 
select * from ts_debug('teowaki', 'I am searching unas 
b squedas ú con postgresql database'); 
also ts_lexize and ts_parser
tokenizing 
tokens + position (stopwords are removed, tokens are folded)
searching 
SELECT guid, description from wakis where 
to_tsvector('teowaki',description) 
@@ to_tsquery('teowaki','postgres');
searching 
SELECT guid, description from wakis where 
to_tsvector('teowaki',description) 
@@ to_tsquery('teowaki','postgres:*');
operators 
SELECT guid, description from wakis where 
to_tsvector('teowaki',description) 
@@ to_tsquery('teowaki','postgres | mysql');
ranking weights 
SELECT setweight(to_tsvector(coalesce(name,'')),'A') || 
setweight(to_tsvector(coalesce(description,'')),'B') 
from wakis limit 1;
search by weight
ranking 
SELECT name, ts_rank(to_tsvector(name), query) rank 
from wakis, to_tsquery('postgres | indexes') query 
where to_tsvector(name) @@ query order by rank DESC; 
also ts_rank_cd
highlighting 
SELECT ts_headline(name, query) from wakis, 
to_tsquery('teowaki', 'game|play') query 
where to_tsvector('teowaki', name) @@ query;
USE POSTGRESQL 
FOR EVERYTHING
When PostgreSQL is not good 
* You need to index files (PDF, Odx...) 
* Your index is very big (slow reindex) 
* You need a distributed index 
* You need complex tokenizers 
* You need advanced rankers
When PostgreSQL is not good 
* You want a REST API 
* You want sentence/ proximity/ range/ 
more complex operators 
* You want search auto completion 
* You want advanced features (alerts...)
But it has been 
perfect for us so far. 
Our users don't care 
which search engine 
we use, as long as 
it works.
PgConf EU 2014 presents 
Javier Ramirez 
* in * 
PostgreSQL 
Full-text search 
demystified 
@supercoco9 
https://teowaki.com

Mais conteúdo relacionado

Mais procurados

[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화NAVER D2
 
[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화Henry Jeong
 
Getting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NETGetting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NETTomas Jansson
 
Morphia: Simplifying Persistence for Java and MongoDB
Morphia:  Simplifying Persistence for Java and MongoDBMorphia:  Simplifying Persistence for Java and MongoDB
Morphia: Simplifying Persistence for Java and MongoDBJeff Yemin
 
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Ontico
 
Fast querying indexing for performance (4)
Fast querying   indexing for performance (4)Fast querying   indexing for performance (4)
Fast querying indexing for performance (4)MongoDB
 
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)MongoDB
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자Donghyeok Kang
 
Webinar: Index Tuning and Evaluation
Webinar: Index Tuning and EvaluationWebinar: Index Tuning and Evaluation
Webinar: Index Tuning and EvaluationMongoDB
 
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)Ontico
 
Indexing and Query Optimization
Indexing and Query OptimizationIndexing and Query Optimization
Indexing and Query OptimizationMongoDB
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Holden Karau
 
Indexing & Query Optimization
Indexing & Query OptimizationIndexing & Query Optimization
Indexing & Query OptimizationMongoDB
 
Ts archiving
Ts   archivingTs   archiving
Ts archivingConfiz
 
MongoDB World 2016: Deciphering .explain() Output
MongoDB World 2016: Deciphering .explain() OutputMongoDB World 2016: Deciphering .explain() Output
MongoDB World 2016: Deciphering .explain() OutputMongoDB
 
Reactive Access to MongoDB from Java 8
Reactive Access to MongoDB from Java 8Reactive Access to MongoDB from Java 8
Reactive Access to MongoDB from Java 8Hermann Hueck
 
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례NAVER D2
 

Mais procurados (20)

[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
MongoDB-SESSION03
MongoDB-SESSION03MongoDB-SESSION03
MongoDB-SESSION03
 
[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화
 
Getting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NETGetting started with Elasticsearch and .NET
Getting started with Elasticsearch and .NET
 
Morphia: Simplifying Persistence for Java and MongoDB
Morphia:  Simplifying Persistence for Java and MongoDBMorphia:  Simplifying Persistence for Java and MongoDB
Morphia: Simplifying Persistence for Java and MongoDB
 
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
Новые возможности полнотекстового поиска в PostgreSQL / Олег Бартунов (Postgr...
 
Fast querying indexing for performance (4)
Fast querying   indexing for performance (4)Fast querying   indexing for performance (4)
Fast querying indexing for performance (4)
 
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)
 
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
 
How to Use JSON in MySQL Wrong
How to Use JSON in MySQL WrongHow to Use JSON in MySQL Wrong
How to Use JSON in MySQL Wrong
 
Webinar: Index Tuning and Evaluation
Webinar: Index Tuning and EvaluationWebinar: Index Tuning and Evaluation
Webinar: Index Tuning and Evaluation
 
Elastic search 검색
Elastic search 검색Elastic search 검색
Elastic search 검색
 
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
Non-Relational Postgres / Bruce Momjian (EnterpriseDB)
 
Indexing and Query Optimization
Indexing and Query OptimizationIndexing and Query Optimization
Indexing and Query Optimization
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
Indexing & Query Optimization
Indexing & Query OptimizationIndexing & Query Optimization
Indexing & Query Optimization
 
Ts archiving
Ts   archivingTs   archiving
Ts archiving
 
MongoDB World 2016: Deciphering .explain() Output
MongoDB World 2016: Deciphering .explain() OutputMongoDB World 2016: Deciphering .explain() Output
MongoDB World 2016: Deciphering .explain() Output
 
Reactive Access to MongoDB from Java 8
Reactive Access to MongoDB from Java 8Reactive Access to MongoDB from Java 8
Reactive Access to MongoDB from Java 8
 
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례
[2C6]SQLite DB 의 입출력 특성분석 : Android 와 Tizen 사례
 

Semelhante a Postgresql search demystified

Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginsearchbox-com
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQLSatoshi Nagayasu
 
Building node.js applications with Database Jones
Building node.js applications with Database JonesBuilding node.js applications with Database Jones
Building node.js applications with Database JonesJohn David Duncan
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenchesIsmail Mayat
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Compass Framework
Compass FrameworkCompass Framework
Compass FrameworkLukas Vlcek
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPDeclarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPStephan Schmidt
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPDeclarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPstubbles
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2PoguttuezhiniVP
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiSatoshi Nagayasu
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari
 
Pxb For Yapc2008
Pxb For Yapc2008Pxb For Yapc2008
Pxb For Yapc2008maximgrp
 
Simplifying Persistence for Java and MongoDB with Morphia
Simplifying Persistence for Java and MongoDB with MorphiaSimplifying Persistence for Java and MongoDB with Morphia
Simplifying Persistence for Java and MongoDB with MorphiaMongoDB
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring DataEric Bottard
 
Introducing Struts 2
Introducing Struts 2Introducing Struts 2
Introducing Struts 2wiradikusuma
 
Softshake - Offline applications
Softshake - Offline applicationsSoftshake - Offline applications
Softshake - Offline applicationsjeromevdl
 

Semelhante a Postgresql search demystified (20)

Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Sumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced AnalyticsSumo Logic "How to" Webinar: Advanced Analytics
Sumo Logic "How to" Webinar: Advanced Analytics
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
Building node.js applications with Database Jones
Building node.js applications with Database JonesBuilding node.js applications with Database Jones
Building node.js applications with Database Jones
 
Get to know PostgreSQL!
Get to know PostgreSQL!Get to know PostgreSQL!
Get to know PostgreSQL!
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPDeclarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHP
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPDeclarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHP
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
 
Pxb For Yapc2008
Pxb For Yapc2008Pxb For Yapc2008
Pxb For Yapc2008
 
Simplifying Persistence for Java and MongoDB with Morphia
Simplifying Persistence for Java and MongoDB with MorphiaSimplifying Persistence for Java and MongoDB with Morphia
Simplifying Persistence for Java and MongoDB with Morphia
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
ERRest and Dojo
ERRest and DojoERRest and Dojo
ERRest and Dojo
 
Introducing Struts 2
Introducing Struts 2Introducing Struts 2
Introducing Struts 2
 
Softshake - Offline applications
Softshake - Offline applicationsSoftshake - Offline applications
Softshake - Offline applications
 

Mais de javier ramirez

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfestjavier ramirez
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databasejavier ramirez
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBjavier ramirez
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)javier ramirez
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Databasejavier ramirez
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728javier ramirez
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022javier ramirez
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...javier ramirez
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragónjavier ramirez
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessjavier ramirez
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloudjavier ramirez
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMjavier ramirez
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analyticsjavier ramirez
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelinejavier ramirez
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Divejavier ramirez
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)javier ramirez
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSjavier ramirez
 

Mais de javier ramirez (20)

¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest¿Se puede vivir del open source? T3chfest
¿Se puede vivir del open source? T3chfest
 
QuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series databaseQuestDB: The building blocks of a fast open-source time-series database
QuestDB: The building blocks of a fast open-source time-series database
 
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDBDeduplicating and analysing time-series data with Apache Beam and QuestDB
Deduplicating and analysing time-series data with Apache Beam and QuestDB
 
Your Database Cannot Do this (well)
Your Database Cannot Do this (well)Your Database Cannot Do this (well)
Your Database Cannot Do this (well)
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728QuestDB-Community-Call-20220728
QuestDB-Community-Call-20220728
 
Processing and analysing streaming data with Python. Pycon Italy 2022
Processing and analysing streaming  data with Python. Pycon Italy 2022Processing and analysing streaming  data with Python. Pycon Italy 2022
Processing and analysing streaming data with Python. Pycon Italy 2022
 
QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...QuestDB: ingesting a million time series per second on a single instance. Big...
QuestDB: ingesting a million time series per second on a single instance. Big...
 
Servicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en AragónServicios e infraestructura de AWS y la próxima región en Aragón
Servicios e infraestructura de AWS y la próxima región en Aragón
 
Primeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverlessPrimeros pasos en desarrollo serverless
Primeros pasos en desarrollo serverless
 
How AWS is reinventing the cloud
How AWS is reinventing the cloudHow AWS is reinventing the cloud
How AWS is reinventing the cloud
 
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAMAnalitica de datos en tiempo real con Apache Flink y Apache BEAM
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
 
Getting started with streaming analytics
Getting started with streaming analyticsGetting started with streaming analytics
Getting started with streaming analytics
 
Getting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipelineGetting started with streaming analytics: Setting up a pipeline
Getting started with streaming analytics: Setting up a pipeline
 
Getting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep DiveGetting started with streaming analytics: Deep Dive
Getting started with streaming analytics: Deep Dive
 
Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)Getting started with streaming analytics: streaming basics (1 of 3)
Getting started with streaming analytics: streaming basics (1 of 3)
 
Monitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWSMonitorización de seguridad y detección de amenazas con AWS
Monitorización de seguridad y detección de amenazas con AWS
 

Último

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 

Último (20)

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 

Postgresql search demystified

  • 1. PgConf EU 2014 presents Javier Ramirez * in * PostgreSQL Full-text search demystified @supercoco9 https://teowaki.com
  • 4.
  • 5. One does not simply SELECT * from stuff where content ilike '%postgresql%'
  • 6.
  • 7.
  • 8. Basic search features * stemmers (run, runner, running) * unaccented (josé, jose) * results highlighting * rank results by relevance
  • 9. Nice to have features * partial searches * search operators (OR, AND...) * synonyms (postgres, postgresql, pgsql) * thesaurus (OS=Operating System) * fast, and space-efficient * debugging
  • 10. Good News: PostgreSQL supports all the requested features
  • 11. Bad News: unless you already know about search engines, the official docs are not obvious
  • 12. How a search engine works * An indexing phase * A search phase
  • 13. The indexing phase Convert the input text to tokens
  • 14. The search phase Match the search terms to the indexed tokens
  • 15. indexing in depth * choose an index format * tokenize the words * apply token analysis/filters * discard unwanted tokens
  • 16. the index format * r-tree (GIST in PostgreSQL) * inverse indexes (GIN in PostgreSQL) * dynamic/distributed indexes
  • 17. dynamic indexes: segmentation * sometimes the token index is segmented to allow faster updates * consolidate segments to speed-up search and account for deletions
  • 18. tokenizing * parse/strip/convert format * normalize terms (unaccent, ascii, charsets, case folding, number precision..)
  • 19. token analysis/filters * find synonyms * expand thesaurus * stem (maybe in different languages)
  • 20. more token analysis/filters * eliminate stopwords * store word distance/frequency * store the full contents of some fields * store some fields as attributes/facets
  • 21. “the index file” is really * a token file, probably segmented/distributed * some dictionary files: synonyms, thesaurus, stopwords, stems/lexems (in different languages) * word distance/frequency info * attributes/original field files * optional geospatial index * auxiliary files: word/sentence boundaries, meta-info, parser definitions, datasource definitions...
  • 22. the hardest part is now over
  • 23. searching in depth * tokenize/analyse * prepare operators * retrieve information * rank the results * highlight the matched parts
  • 24. searching in depth: tokenize normalize, tokenize, and analyse the original search term the result would be a tokenized, stemmed, “synonymised” term, without stopwords
  • 25. searching in depth: operators * partial search * logical/geospatial/range operators * in-sentence/in-paragraph/word distance * faceting/grouping
  • 26. searching in depth: retrieval Go through the token index files, use the attributes and geospatial files if necessary for operators and/or grouping You might need to do this in a distributed way
  • 27. searching in depth: ranking algorithm to sort the most relevant results: * field weights * word frequency/density * geospatial or timestamp ranking * ad-hoc ranking strategies
  • 28. searching in depth: highlighting Mark the matching parts of the results It can be tricky/slow if you are not storing the full contents in your indexes
  • 29. PostgreSQL as a full-text search engine
  • 30. search features * index format configuration * partial search * word boundaries parser (not configurable) * stemmers/synonyms/thesaurus/stopwords * full-text logical operators * attributes/geo/timestamp/range (using SQL) * ranking strategies * highlighting * debugging/testing commands
  • 31. indexing in postgresql you don't actually need an index to use full-text search in PostgreSQL but unless your db is very small, you want to have one Choose GIST or GIN (faster search, slower indexing, larger index size) CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body));
  • 32. Two new things CREATE INDEX ... USING gin(to_tsvector (config_name, body)); * to_tsvector: postgresql way of saying “tokenize” * config_name: tokenizing/analysis rule set
  • 33. Configuration CREATE TEXT SEARCH CONFIGURATION public.teowaki ( COPY = pg_catalog.english );
  • 34. Configuration CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, DictFile = en_us, AffFile = en_us, StopWords = spanglish ); CREATE TEXT SEARCH DICTIONARY spanish_ispell ( TEMPLATE = ispell, DictFile = es_any, AffFile = es_any, StopWords = spanish );
  • 35. Configuration CREATE TEXT SEARCH DICTIONARY english_stem ( TEMPLATE = snowball, Language = english, StopWords = english ); CREATE TEXT SEARCH DICTIONARY spanish_stem ( TEMPLATE= snowball, Language = spanish, Stopwords = spanish );
  • 37. Configuration Assign dictionaries (in specific to generic order) ALTER TEXT SEARCH CONFIGURATION teowaki ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem; ALTER TEXT SEARCH CONFIGURATION teowaki DROP MAPPING FOR email, url, url_path, sfloat, float;
  • 38. debugging select * from ts_debug('teowaki', 'I am searching unas b squedas ú con postgresql database'); also ts_lexize and ts_parser
  • 39. tokenizing tokens + position (stopwords are removed, tokens are folded)
  • 40. searching SELECT guid, description from wakis where to_tsvector('teowaki',description) @@ to_tsquery('teowaki','postgres');
  • 41. searching SELECT guid, description from wakis where to_tsvector('teowaki',description) @@ to_tsquery('teowaki','postgres:*');
  • 42. operators SELECT guid, description from wakis where to_tsvector('teowaki',description) @@ to_tsquery('teowaki','postgres | mysql');
  • 43. ranking weights SELECT setweight(to_tsvector(coalesce(name,'')),'A') || setweight(to_tsvector(coalesce(description,'')),'B') from wakis limit 1;
  • 45. ranking SELECT name, ts_rank(to_tsvector(name), query) rank from wakis, to_tsquery('postgres | indexes') query where to_tsvector(name) @@ query order by rank DESC; also ts_rank_cd
  • 46. highlighting SELECT ts_headline(name, query) from wakis, to_tsquery('teowaki', 'game|play') query where to_tsvector('teowaki', name) @@ query;
  • 47. USE POSTGRESQL FOR EVERYTHING
  • 48. When PostgreSQL is not good * You need to index files (PDF, Odx...) * Your index is very big (slow reindex) * You need a distributed index * You need complex tokenizers * You need advanced rankers
  • 49. When PostgreSQL is not good * You want a REST API * You want sentence/ proximity/ range/ more complex operators * You want search auto completion * You want advanced features (alerts...)
  • 50. But it has been perfect for us so far. Our users don't care which search engine we use, as long as it works.
  • 51. PgConf EU 2014 presents Javier Ramirez * in * PostgreSQL Full-text search demystified @supercoco9 https://teowaki.com