How does a full-text search engine works? How is the index built and searched? Can I use PostgreSQL as a full-text search engine or should I go for a more specialised solution? How does one configure and use PostgreSQL search?
This presentation covers all those aspects, based on the work we did to index teowaki.com. It was presented at PgConf EU 2014 in Madrid
15. indexing in depth
* choose an index format
* tokenize the words
* apply token analysis/filters
* discard unwanted tokens
16. the index format
* r-tree (GIST in PostgreSQL)
* inverse indexes (GIN in PostgreSQL)
* dynamic/distributed indexes
17. dynamic indexes: segmentation
* sometimes the token index is
segmented to allow faster updates
* consolidate segments to speed-up
search and account for deletions
20. more token analysis/filters
* eliminate stopwords
* store word distance/frequency
* store the full contents of some fields
* store some fields as attributes/facets
21. “the index file” is really
* a token file, probably segmented/distributed
* some dictionary files: synonyms, thesaurus,
stopwords, stems/lexems (in different languages)
* word distance/frequency info
* attributes/original field files
* optional geospatial index
* auxiliary files: word/sentence boundaries, meta-info,
parser definitions, datasource definitions...
23. searching in depth
* tokenize/analyse
* prepare operators
* retrieve information
* rank the results
* highlight the matched parts
24. searching in depth: tokenize
normalize, tokenize, and analyse
the original search term
the result would be a tokenized, stemmed,
“synonymised” term, without stopwords
26. searching in depth: retrieval
Go through the token index files, use the
attributes and geospatial files if necessary
for operators and/or grouping
You might need to do this in a distributed way
27. searching in depth: ranking
algorithm to sort the most relevant results:
* field weights
* word frequency/density
* geospatial or timestamp ranking
* ad-hoc ranking strategies
28. searching in depth: highlighting
Mark the matching parts of the results
It can be tricky/slow if you are not storing the full contents
in your indexes
30. search features
* index format configuration
* partial search
* word boundaries parser (not configurable)
* stemmers/synonyms/thesaurus/stopwords
* full-text logical operators
* attributes/geo/timestamp/range (using SQL)
* ranking strategies
* highlighting
* debugging/testing commands
31. indexing in postgresql
you don't actually need an index to use full-text search in PostgreSQL
but unless your db is very small, you want to have one
Choose GIST or GIN (faster search, slower indexing,
larger index size)
CREATE INDEX pgweb_idx ON pgweb USING
gin(to_tsvector(config_name, body));
32. Two new things
CREATE INDEX ... USING gin(to_tsvector (config_name, body));
* to_tsvector: postgresql way of saying “tokenize”
* config_name: tokenizing/analysis rule set
37. Configuration
Assign dictionaries (in specific to generic order)
ALTER TEXT SEARCH CONFIGURATION teowaki
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword,
hword_part
WITH english_ispell, spanish_ispell, spanish_stem, unaccent, english_stem;
ALTER TEXT SEARCH CONFIGURATION teowaki
DROP MAPPING FOR email, url, url_path, sfloat, float;
38. debugging
select * from ts_debug('teowaki', 'I am searching unas
b squedas ú con postgresql database');
also ts_lexize and ts_parser
45. ranking
SELECT name, ts_rank(to_tsvector(name), query) rank
from wakis, to_tsquery('postgres | indexes') query
where to_tsvector(name) @@ query order by rank DESC;
also ts_rank_cd
48. When PostgreSQL is not good
* You need to index files (PDF, Odx...)
* Your index is very big (slow reindex)
* You need a distributed index
* You need complex tokenizers
* You need advanced rankers
49. When PostgreSQL is not good
* You want a REST API
* You want sentence/ proximity/ range/
more complex operators
* You want search auto completion
* You want advanced features (alerts...)
50. But it has been
perfect for us so far.
Our users don't care
which search engine
we use, as long as
it works.
51. PgConf EU 2014 presents
Javier Ramirez
* in *
PostgreSQL
Full-text search
demystified
@supercoco9
https://teowaki.com