SlideShare uma empresa Scribd logo
1 de 60
Baixar para ler offline
Rank Your Results: Using Full Text Search with Natural
Language Queries in PostgreSQL
to get Ranked Results
Jamey Hanson
jhanson@freedomconsultinggroup.com
jamesphanson@yahoo.com
Freedom Consulting Group
http://www.freedomconsultinggroup.com
PGConf US, NYC
March 26, 2015
Full Text Searching (or just text search) provides the
capability to identify natural-language documents that satisfy
a query, and optionally to sort them by relevance to the
query.
}  PostgreSQL 9.4 documentation, section 12.1
What is PostgreSQL Full Text Search?
PGConf US, NYC 26-Mar-20152
}  Focus on semantics, rather than syntax
}  Keep documents within PostgreSQL
}  Apache Solr, Lucene, Sphinx etc. require
their own copy of the data
}  Simple to keep indexes up to date
}  ~20 * faster than SQL search (LIKE,
ILIKE, etc.) with default configuration
}  Fast enough for nearly all applications
What makes Full Text Search so useful?
PGConf US, NYC 26-Mar-20153
}  Traditional search reveals document existence,
ranked results indicate relevance
What makes Full Text Search so useful?
PGConf US, NYC 26-Mar-20154
}  Customer expectations are that all searches should
include rank
}  FTS includes full suite of PG query tools such as SQL,
regex, LIKE/ILIKE, wildcards and function-based
indexes
}  FTS parser, stop word list, synonym, thesaurus and
language are all customizable at the statement level
}  FTS is extensible with 3rd party dictionaries and more
}  Infrastructure and data wrangling
}  Creating FTS tables with maintenance triggers
}  Compare FTS with traditional SQL searches and run FTSs
on documents from early American History
}  Rank search results on documents from Data Science
}  Generate HTML-tagged fragments with matching terms
}  Customize the stop-word dictionary
}  Suggest spelling options for query terms
}  Re-write queries at run-time
I will move between slides, demonstrations and SQL scripts. We will
not review every slide in the file.
Agenda
PGConf US, NYC 26-Mar-20155
Jamey Hanson
jhanson@freedomconsultinggroup.com
jamesphanson@yahoo.com
Manage a team for Freedom Consulting Group migrating
applications from Oracle to Postgres Plus Advanced Server and
PostgreSQL in the government space. We are subcontracting to
EnterpriseDB
Overly certified: PMP, CISSP, CSEP, OCP in 5 versions of Oracle,
Cloudera developer & admin. Used to be NetApp admin and
MCSE. I teach PMP and CISSP at the Univ. MD training center
Alumnus of multiple schools and was C-130 aircrew
About the author
PGConf US, NYC 26-Mar-20156
•  PostgreSQL 9.4.1 EnterpriseDB free package
•  CentOS 7.0VM with 2GB RAM and 2 CPU cores
•  2 sets of documents to search …
•  Primary documents in American History: The American
Revolution and the New Nation
http://www.loc.gov/rr/program/bib/ourdocs/NewNation.html
(Library of Congress)
•  Text from Data Science Girl’s August 15, 2014 blog post
“38 Seminal Articles Every Data Scientist Should Read”
http://www.datasciencecentral.com/profiles/blogs/30-seminal-articles-every-
data-scientist-should-read
Presentation infrastructure …
PGConf US, NYC 26-Mar-20157
}  Used pgAdmin3 for SQL
and administration
}  A few Linux shell commands
to manage the files
}  American history documents
were cut & pasted from Web into MS Notepad
}  Data Science .pdf files were downloaded, converted to
text with pdftotext and manually divided into abstract
and body files
… Presentation infrastructure
PGConf US, NYC 26-Mar-20158
}  FTS is built on lexemes, which are (essentially) word roots
without tense, possessive, plurality or other ending.
“It is a basic unit of meaning, and the headwords of a
dictionary are all lexemes”
The Cambridge Encyclopedia ofThe English Language
}  For example ...
}  The lexeme of jump, jumps, jumped and jumping are all ‘jump’
}  Excited, excites, exciting and excited are all ‘excit’
}  Lexemes are stored in lower case
(i.e. case insensitive)
How does FTS work?
PGConf US, NYC 26-Mar-20159
}  lexemes are organized into TSVECTORs, which are sorted
arrays of lexemes with associated position and (optionally)
weight. Documents are stored as TSVECTORs
}  Query against TSVECTORs using TSQUERYs, which are
arrays of lexemes with BOOLEAN operators but without
position or weight
}  Match a TSQUERY to a TSVECTOR with the @@ operator
How does FTS work?
PGConf US, NYC 26-Mar-201510
1.  Parses text document into tokens using white space, non
printing characters and punctuation
2.  Assigns a class (i.e. type) to each token. 23 classes include
word, email, number, URL, etc.
3.  ‘Word’ tokens are normalized into lexemes using a parser
4.  Lexemes are processed to …
a.  Remove stop words (common words such as ‘and’,‘or’,‘the’)
b.  Add synonyms
c.  Add phrases matching
5.  Lexemes are assembled into TSVECTORs by noting the
position, recording weight and removing duplicates
This process is controlled by TEXT SEARCH DICTIONARYs
How does TO_TSVECTOR work?
PGConf US, NYC 26-Mar-201511
}  TSVECTORs are compared to TSQUERYs with the @@
operator
}  TSQUERYs are built with the TO_TSQUERY or
PLAINTO_TSQUERY functions …
Never mind … let’s jump to some examples, which are
much easier to understand.
How does FTS match documents?
PGConf US, NYC 26-Mar-201512
}  Explore TSVECTORs and TSQUERYs
00_FTS_explore_tsvector_tsquery_v10.sql
GOTO …
PGConf US, NYC 26-Mar-201513
}  -- What do lexemes look like?
SELECT TO_TSVECTOR('enumerate') AS enumerate,
TO_TSVECTOR('enumerated') AS enumerated,
TO_TSVECTOR('enumerates') AS enumerates,
TO_TSVECTOR('enumerating') AS enumerating,
TO_TSVECTOR('enumeration') AS enumeration;
-- all forms of the work have the same lexeme, 'enumer'
-- Example tsvector
SELECT TO_TSVECTOR('We hold these truths to be self evident');
-- 'evid':8 'hold':2 'self':7 'truth':4
-- tsvectors are sorted arrays of lexemes with position and
(optionally) weight
-- notice that common words, a.k.a. stop words, like 'to' and
'be' are not included
TSVECTOR and TSQUERY
PGConf US, NYC 26-Mar-201514
-- tsquery_s are compared with tsvector_s to find matching documents
-- they are composed of lexemes and logical operators
SELECT TO_TSQUERY('with & liberty & and & justice & for & all');
-- 'liberti' & 'justic'
-- Notice that stop words are not included in tsquery_s either
-- can also use PLAINTO_TSQUERY with plain(ish) text
SELECT PLAINTO_TSQUERY('With liberty and justice for all');
-- 'liberti' & 'justic'
TSVECTOR and TSQUERY
PGConf US, NYC 26-Mar-201515
}  Explore TSVECTORs and TSQUERYs
00_FTS_explore_tsvector_tsquery_v10.sql
RETURN from …
PGConf US, NYC 26-Mar-201516
}  Created at run-time with TO_TSVECTOR
+ simple, does not require storage
- slower queries than pre-computing
}  Created ahead of time with TO_TSVECTOR
+ fast queries, flexible, does not slow ingestion, less CPU work
- can leave TEXT and TSVECTOR out of sync, may not get done
}  Create ahead of time with a trigger
+ fast queries,TSVECTOR always up to date
-  slows ingestion, UPDATE trigger first on small changes
}  Two trigger functions are included
tsvector_update_trigger & …_column
How do we create TSVECTORS?
PGConf US, NYC 26-Mar-201517
}  GIN (Generalized
Inverted iNdex)
}  GiST (Generalized
Search Tree)
How to make FTS
wickedly fast?
PGConf US, NYC 26-Mar-201518
GIN GiST
Speed 3 * faster Slower
Size 3 * bigger smaller
WeightedTSV Unsupported Supported
Build speed Slower 3 * faster
Best practice Static data Updated data
See TomasVondra's 17-Mar-15 Planet PostgreSQL post on FTS performance for details
Let’s build our FTS tables
}  Build our FTS tables using 20_FTS_DDL_v10.sql
GOTO …
PGConf US, NYC 26-Mar-201519
}  Build our FTS tables using 20_FTS_DDL_v10.sql
RETURN from …
PGConf US, NYC 26-Mar-201520
}  Load text documents from the database host using
pg_read_file
}  pg_read_binary_file for .pdf_s
}  Files must be in $PGDATA, but symbolic links work
}  Syntax is:
(SELECT * FROM pg_read_file('path/from/
$PGDATA/file.txt'))
}  Weighted searches require that the document is divided
into sections. Details forthcoming
}  Can dynamically generate SQL load scripts using
pg_ls_dir or run a script from psql
Loading documents for FTS
PGConf US, NYC 26-Mar-201521
}  Load our FTS tables using 30_FTS_Load_v10.sql
}  Update title, author and URL fields with
32_FTS_Update_Titles_v10.sql
GOTO …
PGConf US, NYC 26-Mar-201522
}  Load files .txt and .pdf files from within $PGDATA
}  We divided the Data Science documents into abstract and
body so that they can we weighted for weighted rank
queries
}  TSVECTORs are created by the BIU trigger
}  Manually updated fts_data_sci.tsv_document
just to show how it is done
}  The update script populates title, author and URL fields.
Load text and .pdf documents
PGConf US, NYC 26-Mar-201523
-- Create dynamic SQL to load fts_amer_hist
WITH list_of_files AS (
SELECT pg_ls_dir('Dropbox/FTS/AmHistory/') AS file_name
)
SELECT
'INSERT INTO fts.fts_amer_hist (document, filename) VALUES (
(SELECT * FROM pg_read_file(''Dropbox/FTS/AmHistory/' || file_name || ''')),
''' || file_name || '''); '
FROM list_of_files
ORDER BY file_name;
-- generates --
INSERT INTO fts.fts_data_sci (
abstract, body, document, pdf_file, pdf_filename) VALUES (
(SELECT * FROM
pg_read_file('Dropbox/FTS/DataScience/WhatMapReduceCanDo_AB.txt')),
(SELECT * FROM
pg_read_file('Dropbox/FTS/DataScience/WhatMapReduceCanDo_BD.txt')),
(SELECT * FROM
pg_read_file('Dropbox/FTS/DataScience/WhatMapReduceCanDo.txt')),
(SELECT * FROM
pg_read_binary_file('Dropbox/FTS/DataScience/WhatMapReduceCanDo.pdf')),
'WhatMapReduceCanDo.pdf');
Dynamic SQL to load files
PGConf US, NYC 26-Mar-201524
}  Load our FTS tables using 20_FTS_Load_v10.sql
}  Update details with
22_FTS_Update_Titles_v10.sql
RETURN from …
PGConf US, NYC 26-Mar-201525
GOTO …
}  See FTS in action with
40_FTS_explore_fts_amer_hist_v10.sql
Enough with the setup … show me FTS!
PGConf US, NYC 26-Mar-201526
}  Compare SQL ILIKE searches with FTS*
}  See how ILIKE misses documents with different word
forms, such as 'enumerate' vs. 'enumeration'
}  See how FTS is ~20 * faster than ILIKE
}  Demonstrate that FTS excludes stop words such as 'the',
'and', & 'or'
}  Demonstrate that FTS includes BOOLEAN logic with
simple syntax
}  *ILIKE is “case insensitive LIKE”
Explore fts_amer_hist
PGConf US, NYC 26-Mar-201527
}  See FTS in action with
40_FTS_explore_amer_hist_v10.sql
RETURN from …
PGConf US, NYC 26-Mar-201528
}  Result rank is 0 (not found) to 1 (perfect match),
calculated at run time based on a search of all documents.
}  That means TOP 5 is slower than LIMIT 5
}  Two rank functions are available, TS_RANK and
TS_RANK_CD
}  Both consider how often search terms appear
}  Both have an optional normalization parameter that weights the
rank by the log of the size of the document
}  TS_RANK_CD also considers the proximity of search
terms to each other
Ranking results
PGConf US, NYC 26-Mar-201529
}  Lexemes in tsvectors can be assigned a weight of
A(high) – D(low), with defaults of {1.0, 0.4, 0.2, 0.1}
}  Weighting does not affect which records are returned,
only their rank
}  Weighted tsvectors are typically built by document section
}  title=A, abstract=B, body=D in our example trigger
new.tsv_weight_document :=
SETWEIGHT(TO_TSVECTOR('pg_catalog.english',
COALESCE(new.title, '')), 'A') ||
SETWEIGHT(TO_TSVECTOR('pg_catalog.english',
COALESCE(new.abstract, '')), 'B') ||
TO_TSVECTOR('pg_catalog.english', new.body);
Building weighted tsvectors
PGConf US, NYC 26-Mar-201530
}  The example tsvectors were weighted at build-time by the
trigger
}  Can also build weighted tsvectors at query-time
}  More flexible because different queries can use different weights
}  Requires more code because weighting is done for every query
}  Slightly slower because the source tsvectors must be
concatenated
}  SETWEIGHT(TO_TSVECTOR(title),'A') ||
SETWEIGHT(TO_TSVECTOR(abstract,'B') ||
TO_TSVECTOR(body); -- default weight 'D'
Building weighted tsvectors at query-time
PGConf US, NYC 26-Mar-201531
What does all this get us?
PGConf US, NYC 26-Mar-201532
}  Search for document relevance, not just
existence
}  Customers now expect demand ranked
results
}  The data and the business logic are inside
PostgreSQL, available to any application
}  Generate weighted, ranked document searches with
50_FTS_weighted_ranked_results_v10.sql
GOTO …
PGConf US, NYC 26-Mar-201533
}  Top-5 results syntax
SELECT title,
ts_rank(tsv_document, q) AS rank -- value between 0 and 1
FROM
fts_data_sci,
PLAINTO_TSQUERY('corelation big data') AS q
ORDER BY rank DESC
LIMIT 5;
}  Syntax for ts_rank_cd (ts rank with cover density) is the
same
Weighted, ranked document searches
PGConf US, NYC 26-Mar-201534
}  Top-5 weighted results syntax
SELECT title,
ts_rank(tsv_weight_document, q) AS rank -- weighted column
FROM
fts_data_sci,
PLAINTO_TSQUERY('corelation big data') AS q
ORDER BY rank DESC
LIMIT 5;
}  The only difference is using the weighted tsvector
}  Could also have built a weighted tsvector at query time.
Weighted, ranked document searches
PGConf US, NYC 26-Mar-201535
}  Generate weighted, ranked document searches with
50_FTS_weighted_ranked_results_v10.sql
RETURN from …
PGConf US, NYC 26-Mar-201536
}  We have used English language with default parser
(tokenizer), stop-word list and dictionary.
}  The combination is a TEXT SEARCH DICTIONARY
}  The default is pg_catalog.english
}  SELECT default_text_search_config; to see
}  We created tsvectors (weighted and unweighted) using
default and customer triggers plus manually
Pause … with all default configuration
PGConf US, NYC 26-Mar-201537
We can highlight matches with ts_headline
PGConf US, NYC 26-Mar-201538
}  TS_HEADLINE returns text fragment(s) surrounding
matching terms with HTML tags
}  Default is a single snippet with <b>matching_term</b>
}  Search for PLAINTO_TSQUERY('liberty justice happy')
Display fragments with matching terms
PGConf US, NYC 26-Mar-201539
}  How many fragments? MaxFragments
}  What comes between fragments? FragmentDelimiter
}  How many surrounding words? MinWords / MaxWords
}  Which HTML tags highlight terms? StartSel / StopSel
SELECT TS_HEADLINE(document, q,
'StartSel="<font color=red><b>",
StopSel="</font></b>",
MaxFragments=10,
MinWords=5, MaxWords=10,
FragmentDelimiter=" ...<br>..."')
Configure TS_HEADLINE to improve display
PGConf US, NYC 26-Mar-201540
Q: Which American history documents contain 'liberty',
'justice' and 'happy'?
SELECT
'Document title: <i>' || title || '</i><br><br>' ||
TS_HEADLINE(document, q,
'StartSel="<font color=red><b>",
StopSel="</font></b>",
MaxFragments=10,
MinWords=5, MaxWords=10,
FragmentDelimiter=" ...<br>..."')
FROM fts_amer_hist,
PLAINTO_TSQUERY('liberty justice happy') AS q
WHERE tsv_document @@ q
ORDER BY TS_RANK(tsv_document, q) DESC;
Well formatted ts_headline results
41
Very nice …
PGConf US, NYC 26-Mar-201542
GOTO check out the 4 matching documents
PGConf US, NYC 26-Mar-2015
Q: What has FTS gotten us right out of the box?
A: Directly loaded documents that are automatically indexed,
weighted and maintained in a form that supports fast natural
language(ish) queries with ranked results plus well-formatted
document fragments with highlighted matches.
Which is to say, a lot!
}  Create a custom TEXT
SEARCH DICTIONARY
}  Customize the stop word list
based on frequency counts
}  Modify queries at run-time
to remove terms and/or use
synonyms with
TS_REWRITE
}  Create a tool to suggest
spelling corrections for
query terms
Customizing FTS
PGConf US, NYC 26-Mar-201544
}  Defines the language, stopwords, dictfile and other options
for TO_TSVECTOR and TSQUERY related functions
}  Custom dictionaries based on a template
}  pg_catalog.english is the default
}  SHOW default_text_search_config;
}  Uses files in $PGSHAREDIR/tsearch_data
$PGSHAREDIR=$PG_HOME/share/postgresql
}  Option STOPWORDS=english references
$PGSHAREDIR/tsearch_data/english.stop
}  NOTE: Must 'touch' a TS DICT after each file change with
ALTER TEXT SEARCH DICTIONARY
Custom TEXT SEARCH DICTIONARY
PGConf US, NYC 26-Mar-201545
}  TS_STAT(tsvector)returns
}  ndoc the number of documents a lexeme appears in
}  nentry the number of times a lexeme appears
}  This is useful to identify candidate stop words that appear
too frequently to be effective discriminators
}  TS_LEXSIZE('dictionary', 'word')
}  Useful to text if the custom dictionary is working as planned
FTS helpful utility functions
PGConf US, NYC 26-Mar-201546
}  TEXT SEARCH DICTIONARYs change the tsvector,
TS_REWRITE changes the tsquery at SQL run-time
}  Look up tsquery substitution values in a table of:
term TSQUERY
alias TSQUERY
}  Used for alias' or stop words, by substituting ''
}  Ex. 'include' as an alias for 'contain'
'data' as a stop word
INSERT INTO fts_alias VALUES
('contain'::TSQUERY, 'contain | include'::TSQUERY),
('data'::TSQUERY, ''::TSQUERY);
Change the TSQUERY w/TS_REWRITE
PGConf US, NYC 26-Mar-201547
}  Create custom dictionary and stop words plus query
rewrite
60_FTS_stop_words_custom_dictionary_and
_query_v10.sql
GOTO …
PGConf US, NYC 26-Mar-201548
}  Find words that appear frequently and that appear in multiple
document
SELECT word,
nentry AS appears_N_times,
ndoc AS appears_in_N_docs
FROM TS_STAT(
'SELECT tsv_weight_document FROM fts_data_sci')
-- weighted and un weighted tsvector are equiv
ORDER BY
nentry DESC,
word;
Identify candidate stop words
PGConf US, NYC 26-Mar-201549
}  Create custom TEXT SEARCH DICTIONARY
-- DROP TEXT SEARCH DICTIONARY IF EXISTS public.stopper_dict;
CREATE TEXT SEARCH DICTIONARY public.stopper_dict (
TEMPLATE = pg_catalog.simple,
STOPWORDS = english
);
}  Add stop words to
$SHAREDIR/tsearch_data/english.stop
}  'Touch' the dictionary after each change to take effect
ALTER TEXT SEARCH DICTIONARY public.stopper_dict (
STOPWORDS = english
);
Create TEXT SEARCH DICTIONARY
PGConf US, NYC 26-Mar-201550
CREATE TABLE fts_alias (
term TSQUERY PRIMARY KEY,
alias TSQUERY
);
}  Add term alias' and stop words
INSERT INTO fts_alias VALUES
('contain'::TSQUERY, 'contain | include'::TSQUERY),
('data'::TSQUERY, ''::TSQUERY);
);
}  Use the alias table with TS_REWRITE
SELECT TS_REWRITE(('data & information & contain')::TSQUERY,
'SELECT * FROM fts_alias');
-- 'information' & ( 'include' | 'contain' )
Create table for TS_REWRITE
PGConf US, NYC 26-Mar-201551
}  Create custom dictionary and stop words plus query
rewrite
60_FTS_stop_words_custom_dictionary_and
_query_v10.sql
RETURN from…
PGConf US, NYC 26-Mar-201552
}  Use TriGrams , the pg_trgm extension, with a list of
words in the documents to identify words that are close
to input queries as suggestions for misspelled terms
}  Step 1: create a table of words
CREATE TABLE fts_word AS
(SELECT word
FROM TS_STAT(
'SELECT tsv_document FROM fts_amer_hist'))
UNION
(SELECT word
FROM TS_STAT(
'SELECT tsv_document FROM fts_data_sci')
);
Query-term spelling suggestions with TriGrams
}  Step 2: create an index
CREATE INDEX idx_fts_word ON fts_word USING
GIN(word gin_trgm_ops);
}  Step 3: query for 'close' terms that exist in the corpus
SELECT word, sml
FROM fts_word,
SIMILARITY(word, 'asymetric') AS sml
WHERE sml > 0.333
-- arbitrary value to filter results
Check it out in action …
Query-term spelling suggestions with TriGrams
}  SELECT word, sml
FROM fts_word,
SIMILARITY(word, 'asymetric') AS sml
-- 'asymmetric' is the correct spelling
WHERE sml > 0.333 -- arbitrary value to filter results
ORDER BY sml DESC, word;
word sml
========== ========
asymetr 0.636364
asymmetri 0.538462
asymmetr 0.461538
Metric 0.416667
Suggested spelling with TriGrams
PGConf US, NYC 26-Mar-201555
}  Infrastructure and data wrangling
}  Creating FTS tables with maintenance triggers
and load our data
}  Compare FTS with traditional SQL searches and run FTSs
on documents from early American History
}  Rank search results on documents from Data Science
}  Generate HTML-tagged fragments with matching terms
}  Customize the stop-word dictionary
}  Suggest spelling options for query terms
}  Re-write queries at run-time
All slides, data and scripts are on the on PGConf web site
Summary and review
PGConf US, NYC 26-Mar-201556
LIFE. LIBERTY. TECHNOLOGY.
Freedom Consulting Group is a talented, hard-working,
and committed partner, providing hardware, software
and database development and integration services
to a diverse set of clients.
“
”
POSTGRES
innovation
ENTERPRISE
reliability
24/7
support
Services
& training
Enterprise-class
features, tools &
compatibility
Indemnification
Product
road-map
Control
Thousands
of developers
Fast
development
cycles
Low cost
No vendor
lock-in
Advanced
features
Enabling commercial
adoption of Postgres
Are
there
any
Questions or
follow up?
PGConf US, NYC 26-Mar-201559
freedomconsultinggroup.com/jhanson
Freedom Consulting Group
www.freedomconsultinggroup.com
Jamey Hanson
jhanson@freedomconsultinggroup.com
jamesphanson@yahoo.com

Mais conteúdo relacionado

Mais procurados

Secondary Index Search in InnoDB
Secondary Index Search in InnoDBSecondary Index Search in InnoDB
Secondary Index Search in InnoDBMIJIN AN
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOAltinity Ltd
 
LINE LIVE のチャットが
30,000+/min のコメント投稿を捌くようになるまで
LINE LIVE のチャットが
30,000+/min のコメント投稿を捌くようになるまでLINE LIVE のチャットが
30,000+/min のコメント投稿を捌くようになるまで
LINE LIVE のチャットが
30,000+/min のコメント投稿を捌くようになるまでLINE Corporation
 
Osc2015北海道 札幌my sql勉強会_波多野_r3
Osc2015北海道 札幌my sql勉強会_波多野_r3Osc2015北海道 札幌my sql勉強会_波多野_r3
Osc2015北海道 札幌my sql勉強会_波多野_r3Nobuhiro Hatano
 
[오픈소스컨설팅]Tomcat6&7 How To
[오픈소스컨설팅]Tomcat6&7 How To[오픈소스컨설팅]Tomcat6&7 How To
[오픈소스컨설팅]Tomcat6&7 How ToJi-Woong Choi
 
OSMC 2022 | Logstash, Beats, Elastic Agent, Open Telemetry — what’s the right...
OSMC 2022 | Logstash, Beats, Elastic Agent, Open Telemetry — what’s the right...OSMC 2022 | Logstash, Beats, Elastic Agent, Open Telemetry — what’s the right...
OSMC 2022 | Logstash, Beats, Elastic Agent, Open Telemetry — what’s the right...NETWAYS
 
MariaDB Columnstore 使いこなそう
MariaDB Columnstore 使いこなそうMariaDB Columnstore 使いこなそう
MariaDB Columnstore 使いこなそうKAWANO KAZUYUKI
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow IntroductionLiangjun Jiang
 
pg_walinspectについて調べてみた!(第37回PostgreSQLアンカンファレンス@オンライン 発表資料)
pg_walinspectについて調べてみた!(第37回PostgreSQLアンカンファレンス@オンライン 発表資料)pg_walinspectについて調べてみた!(第37回PostgreSQLアンカンファレンス@オンライン 発表資料)
pg_walinspectについて調べてみた!(第37回PostgreSQLアンカンファレンス@オンライン 発表資料)NTT DATA Technology & Innovation
 
AWS Lambdaによるデータ処理理の⾃自動化とコモディティ化
AWS Lambdaによるデータ処理理の⾃自動化とコモディティ化AWS Lambdaによるデータ処理理の⾃自動化とコモディティ化
AWS Lambdaによるデータ処理理の⾃自動化とコモディティ化Amazon Web Services Japan
 
cLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHousecLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHouseAltinity Ltd
 
Elk devops
Elk devopsElk devops
Elk devopsIdeato
 
PostgreSQL major version upgrade using built in Logical Replication
PostgreSQL major version upgrade using built in Logical ReplicationPostgreSQL major version upgrade using built in Logical Replication
PostgreSQL major version upgrade using built in Logical ReplicationAtsushi Torikoshi
 
PostgreSQLアーキテクチャ入門(INSIGHT OUT 2011)
PostgreSQLアーキテクチャ入門(INSIGHT OUT 2011)PostgreSQLアーキテクチャ入門(INSIGHT OUT 2011)
PostgreSQLアーキテクチャ入門(INSIGHT OUT 2011)Uptime Technologies LLC (JP)
 
S3をDB利用 ショッピングセンター向けポイントシステム概要
S3をDB利用 ショッピングセンター向けポイントシステム概要S3をDB利用 ショッピングセンター向けポイントシステム概要
S3をDB利用 ショッピングセンター向けポイントシステム概要一成 田部井
 
2023 COSCUP - Whats new in PostgreSQL 16
2023 COSCUP - Whats new in PostgreSQL 162023 COSCUP - Whats new in PostgreSQL 16
2023 COSCUP - Whats new in PostgreSQL 16José Lin
 
ストリーム処理におけるApache Avroの活用について(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
ストリーム処理におけるApache Avroの活用について(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)ストリーム処理におけるApache Avroの活用について(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
ストリーム処理におけるApache Avroの活用について(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)NTT DATA Technology & Innovation
 

Mais procurados (20)

Secondary Index Search in InnoDB
Secondary Index Search in InnoDBSecondary Index Search in InnoDB
Secondary Index Search in InnoDB
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
 
PostreSQL監査
PostreSQL監査PostreSQL監査
PostreSQL監査
 
LINE LIVE のチャットが
30,000+/min のコメント投稿を捌くようになるまで
LINE LIVE のチャットが
30,000+/min のコメント投稿を捌くようになるまでLINE LIVE のチャットが
30,000+/min のコメント投稿を捌くようになるまで
LINE LIVE のチャットが
30,000+/min のコメント投稿を捌くようになるまで
 
Osc2015北海道 札幌my sql勉強会_波多野_r3
Osc2015北海道 札幌my sql勉強会_波多野_r3Osc2015北海道 札幌my sql勉強会_波多野_r3
Osc2015北海道 札幌my sql勉強会_波多野_r3
 
[오픈소스컨설팅]Tomcat6&7 How To
[오픈소스컨설팅]Tomcat6&7 How To[오픈소스컨설팅]Tomcat6&7 How To
[오픈소스컨설팅]Tomcat6&7 How To
 
OSMC 2022 | Logstash, Beats, Elastic Agent, Open Telemetry — what’s the right...
OSMC 2022 | Logstash, Beats, Elastic Agent, Open Telemetry — what’s the right...OSMC 2022 | Logstash, Beats, Elastic Agent, Open Telemetry — what’s the right...
OSMC 2022 | Logstash, Beats, Elastic Agent, Open Telemetry — what’s the right...
 
MariaDB Columnstore 使いこなそう
MariaDB Columnstore 使いこなそうMariaDB Columnstore 使いこなそう
MariaDB Columnstore 使いこなそう
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
 
Elk stack
Elk stackElk stack
Elk stack
 
pg_walinspectについて調べてみた!(第37回PostgreSQLアンカンファレンス@オンライン 発表資料)
pg_walinspectについて調べてみた!(第37回PostgreSQLアンカンファレンス@オンライン 発表資料)pg_walinspectについて調べてみた!(第37回PostgreSQLアンカンファレンス@オンライン 発表資料)
pg_walinspectについて調べてみた!(第37回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
ClickHouse Keeper
ClickHouse KeeperClickHouse Keeper
ClickHouse Keeper
 
AWS Lambdaによるデータ処理理の⾃自動化とコモディティ化
AWS Lambdaによるデータ処理理の⾃自動化とコモディティ化AWS Lambdaによるデータ処理理の⾃自動化とコモディティ化
AWS Lambdaによるデータ処理理の⾃自動化とコモディティ化
 
cLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHousecLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHouse
 
Elk devops
Elk devopsElk devops
Elk devops
 
PostgreSQL major version upgrade using built in Logical Replication
PostgreSQL major version upgrade using built in Logical ReplicationPostgreSQL major version upgrade using built in Logical Replication
PostgreSQL major version upgrade using built in Logical Replication
 
PostgreSQLアーキテクチャ入門(INSIGHT OUT 2011)
PostgreSQLアーキテクチャ入門(INSIGHT OUT 2011)PostgreSQLアーキテクチャ入門(INSIGHT OUT 2011)
PostgreSQLアーキテクチャ入門(INSIGHT OUT 2011)
 
S3をDB利用 ショッピングセンター向けポイントシステム概要
S3をDB利用 ショッピングセンター向けポイントシステム概要S3をDB利用 ショッピングセンター向けポイントシステム概要
S3をDB利用 ショッピングセンター向けポイントシステム概要
 
2023 COSCUP - Whats new in PostgreSQL 16
2023 COSCUP - Whats new in PostgreSQL 162023 COSCUP - Whats new in PostgreSQL 16
2023 COSCUP - Whats new in PostgreSQL 16
 
ストリーム処理におけるApache Avroの活用について(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
ストリーム処理におけるApache Avroの活用について(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)ストリーム処理におけるApache Avroの活用について(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
ストリーム処理におけるApache Avroの活用について(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
 

Semelhante a Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachFindwise
 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)maclean liu
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfcadejaumafiq
 
Defining Viewpoints for Ontology-Based DSLs
Defining Viewpoints for Ontology-Based DSLsDefining Viewpoints for Ontology-Based DSLs
Defining Viewpoints for Ontology-Based DSLsObeo
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchIsmaeel Enjreny
 
Data Structure.pptx
Data Structure.pptxData Structure.pptx
Data Structure.pptxSajalFayyaz
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluationavniS
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresSteven Johnson
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. ElasticsearchSelecto
 
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享Chengjen Lee
 
A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...Editor IJCATR
 
A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...Editor IJCATR
 

Semelhante a Rank Your Results with PostgreSQL Full Text Search (from PGConf2015) (20)

 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Pgbr 2013 fts
Pgbr 2013 ftsPgbr 2013 fts
Pgbr 2013 fts
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
 
Defining Viewpoints for Ontology-Based DSLs
Defining Viewpoints for Ontology-Based DSLsDefining Viewpoints for Ontology-Based DSLs
Defining Viewpoints for Ontology-Based DSLs
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Data Structure.pptx
Data Structure.pptxData Structure.pptx
Data Structure.pptx
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
For project
For projectFor project
For project
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. Elasticsearch
 
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
 
Presto
PrestoPresto
Presto
 
Text classification
Text classificationText classification
Text classification
 
A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...
 
A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...A Review of Data Access Optimization Techniques in a Distributed Database Man...
A Review of Data Access Optimization Techniques in a Distributed Database Man...
 

Último

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Último (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)

  • 1. Rank Your Results: Using Full Text Search with Natural Language Queries in PostgreSQL to get Ranked Results Jamey Hanson jhanson@freedomconsultinggroup.com jamesphanson@yahoo.com Freedom Consulting Group http://www.freedomconsultinggroup.com PGConf US, NYC March 26, 2015
  • 2. Full Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. }  PostgreSQL 9.4 documentation, section 12.1 What is PostgreSQL Full Text Search? PGConf US, NYC 26-Mar-20152
  • 3. }  Focus on semantics, rather than syntax }  Keep documents within PostgreSQL }  Apache Solr, Lucene, Sphinx etc. require their own copy of the data }  Simple to keep indexes up to date }  ~20 * faster than SQL search (LIKE, ILIKE, etc.) with default configuration }  Fast enough for nearly all applications What makes Full Text Search so useful? PGConf US, NYC 26-Mar-20153
  • 4. }  Traditional search reveals document existence, ranked results indicate relevance What makes Full Text Search so useful? PGConf US, NYC 26-Mar-20154 }  Customer expectations are that all searches should include rank }  FTS includes full suite of PG query tools such as SQL, regex, LIKE/ILIKE, wildcards and function-based indexes }  FTS parser, stop word list, synonym, thesaurus and language are all customizable at the statement level }  FTS is extensible with 3rd party dictionaries and more
  • 5. }  Infrastructure and data wrangling }  Creating FTS tables with maintenance triggers }  Compare FTS with traditional SQL searches and run FTSs on documents from early American History }  Rank search results on documents from Data Science }  Generate HTML-tagged fragments with matching terms }  Customize the stop-word dictionary }  Suggest spelling options for query terms }  Re-write queries at run-time I will move between slides, demonstrations and SQL scripts. We will not review every slide in the file. Agenda PGConf US, NYC 26-Mar-20155
  • 6. Jamey Hanson jhanson@freedomconsultinggroup.com jamesphanson@yahoo.com Manage a team for Freedom Consulting Group migrating applications from Oracle to Postgres Plus Advanced Server and PostgreSQL in the government space. We are subcontracting to EnterpriseDB Overly certified: PMP, CISSP, CSEP, OCP in 5 versions of Oracle, Cloudera developer & admin. Used to be NetApp admin and MCSE. I teach PMP and CISSP at the Univ. MD training center Alumnus of multiple schools and was C-130 aircrew About the author PGConf US, NYC 26-Mar-20156
  • 7. •  PostgreSQL 9.4.1 EnterpriseDB free package •  CentOS 7.0VM with 2GB RAM and 2 CPU cores •  2 sets of documents to search … •  Primary documents in American History: The American Revolution and the New Nation http://www.loc.gov/rr/program/bib/ourdocs/NewNation.html (Library of Congress) •  Text from Data Science Girl’s August 15, 2014 blog post “38 Seminal Articles Every Data Scientist Should Read” http://www.datasciencecentral.com/profiles/blogs/30-seminal-articles-every- data-scientist-should-read Presentation infrastructure … PGConf US, NYC 26-Mar-20157
  • 8. }  Used pgAdmin3 for SQL and administration }  A few Linux shell commands to manage the files }  American history documents were cut & pasted from Web into MS Notepad }  Data Science .pdf files were downloaded, converted to text with pdftotext and manually divided into abstract and body files … Presentation infrastructure PGConf US, NYC 26-Mar-20158
  • 9. }  FTS is built on lexemes, which are (essentially) word roots without tense, possessive, plurality or other ending. “It is a basic unit of meaning, and the headwords of a dictionary are all lexemes” The Cambridge Encyclopedia ofThe English Language }  For example ... }  The lexeme of jump, jumps, jumped and jumping are all ‘jump’ }  Excited, excites, exciting and excited are all ‘excit’ }  Lexemes are stored in lower case (i.e. case insensitive) How does FTS work? PGConf US, NYC 26-Mar-20159
  • 10. }  lexemes are organized into TSVECTORs, which are sorted arrays of lexemes with associated position and (optionally) weight. Documents are stored as TSVECTORs }  Query against TSVECTORs using TSQUERYs, which are arrays of lexemes with BOOLEAN operators but without position or weight }  Match a TSQUERY to a TSVECTOR with the @@ operator How does FTS work? PGConf US, NYC 26-Mar-201510
  • 11. 1.  Parses text document into tokens using white space, non printing characters and punctuation 2.  Assigns a class (i.e. type) to each token. 23 classes include word, email, number, URL, etc. 3.  ‘Word’ tokens are normalized into lexemes using a parser 4.  Lexemes are processed to … a.  Remove stop words (common words such as ‘and’,‘or’,‘the’) b.  Add synonyms c.  Add phrases matching 5.  Lexemes are assembled into TSVECTORs by noting the position, recording weight and removing duplicates This process is controlled by TEXT SEARCH DICTIONARYs How does TO_TSVECTOR work? PGConf US, NYC 26-Mar-201511
  • 12. }  TSVECTORs are compared to TSQUERYs with the @@ operator }  TSQUERYs are built with the TO_TSQUERY or PLAINTO_TSQUERY functions … Never mind … let’s jump to some examples, which are much easier to understand. How does FTS match documents? PGConf US, NYC 26-Mar-201512
  • 13. }  Explore TSVECTORs and TSQUERYs 00_FTS_explore_tsvector_tsquery_v10.sql GOTO … PGConf US, NYC 26-Mar-201513
  • 14. }  -- What do lexemes look like? SELECT TO_TSVECTOR('enumerate') AS enumerate, TO_TSVECTOR('enumerated') AS enumerated, TO_TSVECTOR('enumerates') AS enumerates, TO_TSVECTOR('enumerating') AS enumerating, TO_TSVECTOR('enumeration') AS enumeration; -- all forms of the work have the same lexeme, 'enumer' -- Example tsvector SELECT TO_TSVECTOR('We hold these truths to be self evident'); -- 'evid':8 'hold':2 'self':7 'truth':4 -- tsvectors are sorted arrays of lexemes with position and (optionally) weight -- notice that common words, a.k.a. stop words, like 'to' and 'be' are not included TSVECTOR and TSQUERY PGConf US, NYC 26-Mar-201514
  • 15. -- tsquery_s are compared with tsvector_s to find matching documents -- they are composed of lexemes and logical operators SELECT TO_TSQUERY('with & liberty & and & justice & for & all'); -- 'liberti' & 'justic' -- Notice that stop words are not included in tsquery_s either -- can also use PLAINTO_TSQUERY with plain(ish) text SELECT PLAINTO_TSQUERY('With liberty and justice for all'); -- 'liberti' & 'justic' TSVECTOR and TSQUERY PGConf US, NYC 26-Mar-201515
  • 16. }  Explore TSVECTORs and TSQUERYs 00_FTS_explore_tsvector_tsquery_v10.sql RETURN from … PGConf US, NYC 26-Mar-201516
  • 17. }  Created at run-time with TO_TSVECTOR + simple, does not require storage - slower queries than pre-computing }  Created ahead of time with TO_TSVECTOR + fast queries, flexible, does not slow ingestion, less CPU work - can leave TEXT and TSVECTOR out of sync, may not get done }  Create ahead of time with a trigger + fast queries,TSVECTOR always up to date -  slows ingestion, UPDATE trigger first on small changes }  Two trigger functions are included tsvector_update_trigger & …_column How do we create TSVECTORS? PGConf US, NYC 26-Mar-201517
  • 18. }  GIN (Generalized Inverted iNdex) }  GiST (Generalized Search Tree) How to make FTS wickedly fast? PGConf US, NYC 26-Mar-201518 GIN GiST Speed 3 * faster Slower Size 3 * bigger smaller WeightedTSV Unsupported Supported Build speed Slower 3 * faster Best practice Static data Updated data See TomasVondra's 17-Mar-15 Planet PostgreSQL post on FTS performance for details
  • 19. Let’s build our FTS tables }  Build our FTS tables using 20_FTS_DDL_v10.sql GOTO … PGConf US, NYC 26-Mar-201519
  • 20. }  Build our FTS tables using 20_FTS_DDL_v10.sql RETURN from … PGConf US, NYC 26-Mar-201520
  • 21. }  Load text documents from the database host using pg_read_file }  pg_read_binary_file for .pdf_s }  Files must be in $PGDATA, but symbolic links work }  Syntax is: (SELECT * FROM pg_read_file('path/from/ $PGDATA/file.txt')) }  Weighted searches require that the document is divided into sections. Details forthcoming }  Can dynamically generate SQL load scripts using pg_ls_dir or run a script from psql Loading documents for FTS PGConf US, NYC 26-Mar-201521
  • 22. }  Load our FTS tables using 30_FTS_Load_v10.sql }  Update title, author and URL fields with 32_FTS_Update_Titles_v10.sql GOTO … PGConf US, NYC 26-Mar-201522
  • 23. }  Load files .txt and .pdf files from within $PGDATA }  We divided the Data Science documents into abstract and body so that they can we weighted for weighted rank queries }  TSVECTORs are created by the BIU trigger }  Manually updated fts_data_sci.tsv_document just to show how it is done }  The update script populates title, author and URL fields. Load text and .pdf documents PGConf US, NYC 26-Mar-201523
  • 24. -- Create dynamic SQL to load fts_amer_hist WITH list_of_files AS ( SELECT pg_ls_dir('Dropbox/FTS/AmHistory/') AS file_name ) SELECT 'INSERT INTO fts.fts_amer_hist (document, filename) VALUES ( (SELECT * FROM pg_read_file(''Dropbox/FTS/AmHistory/' || file_name || ''')), ''' || file_name || '''); ' FROM list_of_files ORDER BY file_name; -- generates -- INSERT INTO fts.fts_data_sci ( abstract, body, document, pdf_file, pdf_filename) VALUES ( (SELECT * FROM pg_read_file('Dropbox/FTS/DataScience/WhatMapReduceCanDo_AB.txt')), (SELECT * FROM pg_read_file('Dropbox/FTS/DataScience/WhatMapReduceCanDo_BD.txt')), (SELECT * FROM pg_read_file('Dropbox/FTS/DataScience/WhatMapReduceCanDo.txt')), (SELECT * FROM pg_read_binary_file('Dropbox/FTS/DataScience/WhatMapReduceCanDo.pdf')), 'WhatMapReduceCanDo.pdf'); Dynamic SQL to load files PGConf US, NYC 26-Mar-201524
  • 25. }  Load our FTS tables using 20_FTS_Load_v10.sql }  Update details with 22_FTS_Update_Titles_v10.sql RETURN from … PGConf US, NYC 26-Mar-201525
  • 26. GOTO … }  See FTS in action with 40_FTS_explore_fts_amer_hist_v10.sql Enough with the setup … show me FTS! PGConf US, NYC 26-Mar-201526
  • 27. }  Compare SQL ILIKE searches with FTS* }  See how ILIKE misses documents with different word forms, such as 'enumerate' vs. 'enumeration' }  See how FTS is ~20 * faster than ILIKE }  Demonstrate that FTS excludes stop words such as 'the', 'and', & 'or' }  Demonstrate that FTS includes BOOLEAN logic with simple syntax }  *ILIKE is “case insensitive LIKE” Explore fts_amer_hist PGConf US, NYC 26-Mar-201527
  • 28. }  See FTS in action with 40_FTS_explore_amer_hist_v10.sql RETURN from … PGConf US, NYC 26-Mar-201528
  • 29. }  Result rank is 0 (not found) to 1 (perfect match), calculated at run time based on a search of all documents. }  That means TOP 5 is slower than LIMIT 5 }  Two rank functions are available, TS_RANK and TS_RANK_CD }  Both consider how often search terms appear }  Both have an optional normalization parameter that weights the rank by the log of the size of the document }  TS_RANK_CD also considers the proximity of search terms to each other Ranking results PGConf US, NYC 26-Mar-201529
  • 30. }  Lexemes in tsvectors can be assigned a weight of A(high) – D(low), with defaults of {1.0, 0.4, 0.2, 0.1} }  Weighting does not affect which records are returned, only their rank }  Weighted tsvectors are typically built by document section }  title=A, abstract=B, body=D in our example trigger new.tsv_weight_document := SETWEIGHT(TO_TSVECTOR('pg_catalog.english', COALESCE(new.title, '')), 'A') || SETWEIGHT(TO_TSVECTOR('pg_catalog.english', COALESCE(new.abstract, '')), 'B') || TO_TSVECTOR('pg_catalog.english', new.body); Building weighted tsvectors PGConf US, NYC 26-Mar-201530
  • 31. }  The example tsvectors were weighted at build-time by the trigger }  Can also build weighted tsvectors at query-time }  More flexible because different queries can use different weights }  Requires more code because weighting is done for every query }  Slightly slower because the source tsvectors must be concatenated }  SETWEIGHT(TO_TSVECTOR(title),'A') || SETWEIGHT(TO_TSVECTOR(abstract,'B') || TO_TSVECTOR(body); -- default weight 'D' Building weighted tsvectors at query-time PGConf US, NYC 26-Mar-201531
  • 32. What does all this get us? PGConf US, NYC 26-Mar-201532 }  Search for document relevance, not just existence }  Customers now expect demand ranked results }  The data and the business logic are inside PostgreSQL, available to any application
  • 33. }  Generate weighted, ranked document searches with 50_FTS_weighted_ranked_results_v10.sql GOTO … PGConf US, NYC 26-Mar-201533
  • 34. }  Top-5 results syntax SELECT title, ts_rank(tsv_document, q) AS rank -- value between 0 and 1 FROM fts_data_sci, PLAINTO_TSQUERY('corelation big data') AS q ORDER BY rank DESC LIMIT 5; }  Syntax for ts_rank_cd (ts rank with cover density) is the same Weighted, ranked document searches PGConf US, NYC 26-Mar-201534
  • 35. }  Top-5 weighted results syntax SELECT title, ts_rank(tsv_weight_document, q) AS rank -- weighted column FROM fts_data_sci, PLAINTO_TSQUERY('corelation big data') AS q ORDER BY rank DESC LIMIT 5; }  The only difference is using the weighted tsvector }  Could also have built a weighted tsvector at query time. Weighted, ranked document searches PGConf US, NYC 26-Mar-201535
  • 36. }  Generate weighted, ranked document searches with 50_FTS_weighted_ranked_results_v10.sql RETURN from … PGConf US, NYC 26-Mar-201536
  • 37. }  We have used English language with default parser (tokenizer), stop-word list and dictionary. }  The combination is a TEXT SEARCH DICTIONARY }  The default is pg_catalog.english }  SELECT default_text_search_config; to see }  We created tsvectors (weighted and unweighted) using default and customer triggers plus manually Pause … with all default configuration PGConf US, NYC 26-Mar-201537
  • 38. We can highlight matches with ts_headline PGConf US, NYC 26-Mar-201538
  • 39. }  TS_HEADLINE returns text fragment(s) surrounding matching terms with HTML tags }  Default is a single snippet with <b>matching_term</b> }  Search for PLAINTO_TSQUERY('liberty justice happy') Display fragments with matching terms PGConf US, NYC 26-Mar-201539
  • 40. }  How many fragments? MaxFragments }  What comes between fragments? FragmentDelimiter }  How many surrounding words? MinWords / MaxWords }  Which HTML tags highlight terms? StartSel / StopSel SELECT TS_HEADLINE(document, q, 'StartSel="<font color=red><b>", StopSel="</font></b>", MaxFragments=10, MinWords=5, MaxWords=10, FragmentDelimiter=" ...<br>..."') Configure TS_HEADLINE to improve display PGConf US, NYC 26-Mar-201540
  • 41. Q: Which American history documents contain 'liberty', 'justice' and 'happy'? SELECT 'Document title: <i>' || title || '</i><br><br>' || TS_HEADLINE(document, q, 'StartSel="<font color=red><b>", StopSel="</font></b>", MaxFragments=10, MinWords=5, MaxWords=10, FragmentDelimiter=" ...<br>..."') FROM fts_amer_hist, PLAINTO_TSQUERY('liberty justice happy') AS q WHERE tsv_document @@ q ORDER BY TS_RANK(tsv_document, q) DESC; Well formatted ts_headline results 41
  • 42. Very nice … PGConf US, NYC 26-Mar-201542 GOTO check out the 4 matching documents
  • 43. PGConf US, NYC 26-Mar-2015 Q: What has FTS gotten us right out of the box? A: Directly loaded documents that are automatically indexed, weighted and maintained in a form that supports fast natural language(ish) queries with ranked results plus well-formatted document fragments with highlighted matches. Which is to say, a lot!
  • 44. }  Create a custom TEXT SEARCH DICTIONARY }  Customize the stop word list based on frequency counts }  Modify queries at run-time to remove terms and/or use synonyms with TS_REWRITE }  Create a tool to suggest spelling corrections for query terms Customizing FTS PGConf US, NYC 26-Mar-201544
  • 45. }  Defines the language, stopwords, dictfile and other options for TO_TSVECTOR and TSQUERY related functions }  Custom dictionaries based on a template }  pg_catalog.english is the default }  SHOW default_text_search_config; }  Uses files in $PGSHAREDIR/tsearch_data $PGSHAREDIR=$PG_HOME/share/postgresql }  Option STOPWORDS=english references $PGSHAREDIR/tsearch_data/english.stop }  NOTE: Must 'touch' a TS DICT after each file change with ALTER TEXT SEARCH DICTIONARY Custom TEXT SEARCH DICTIONARY PGConf US, NYC 26-Mar-201545
  • 46. }  TS_STAT(tsvector)returns }  ndoc the number of documents a lexeme appears in }  nentry the number of times a lexeme appears }  This is useful to identify candidate stop words that appear too frequently to be effective discriminators }  TS_LEXSIZE('dictionary', 'word') }  Useful to text if the custom dictionary is working as planned FTS helpful utility functions PGConf US, NYC 26-Mar-201546
  • 47. }  TEXT SEARCH DICTIONARYs change the tsvector, TS_REWRITE changes the tsquery at SQL run-time }  Look up tsquery substitution values in a table of: term TSQUERY alias TSQUERY }  Used for alias' or stop words, by substituting '' }  Ex. 'include' as an alias for 'contain' 'data' as a stop word INSERT INTO fts_alias VALUES ('contain'::TSQUERY, 'contain | include'::TSQUERY), ('data'::TSQUERY, ''::TSQUERY); Change the TSQUERY w/TS_REWRITE PGConf US, NYC 26-Mar-201547
  • 48. }  Create custom dictionary and stop words plus query rewrite 60_FTS_stop_words_custom_dictionary_and _query_v10.sql GOTO … PGConf US, NYC 26-Mar-201548
  • 49. }  Find words that appear frequently and that appear in multiple document SELECT word, nentry AS appears_N_times, ndoc AS appears_in_N_docs FROM TS_STAT( 'SELECT tsv_weight_document FROM fts_data_sci') -- weighted and un weighted tsvector are equiv ORDER BY nentry DESC, word; Identify candidate stop words PGConf US, NYC 26-Mar-201549
  • 50. }  Create custom TEXT SEARCH DICTIONARY -- DROP TEXT SEARCH DICTIONARY IF EXISTS public.stopper_dict; CREATE TEXT SEARCH DICTIONARY public.stopper_dict ( TEMPLATE = pg_catalog.simple, STOPWORDS = english ); }  Add stop words to $SHAREDIR/tsearch_data/english.stop }  'Touch' the dictionary after each change to take effect ALTER TEXT SEARCH DICTIONARY public.stopper_dict ( STOPWORDS = english ); Create TEXT SEARCH DICTIONARY PGConf US, NYC 26-Mar-201550
  • 51. CREATE TABLE fts_alias ( term TSQUERY PRIMARY KEY, alias TSQUERY ); }  Add term alias' and stop words INSERT INTO fts_alias VALUES ('contain'::TSQUERY, 'contain | include'::TSQUERY), ('data'::TSQUERY, ''::TSQUERY); ); }  Use the alias table with TS_REWRITE SELECT TS_REWRITE(('data & information & contain')::TSQUERY, 'SELECT * FROM fts_alias'); -- 'information' & ( 'include' | 'contain' ) Create table for TS_REWRITE PGConf US, NYC 26-Mar-201551
  • 52. }  Create custom dictionary and stop words plus query rewrite 60_FTS_stop_words_custom_dictionary_and _query_v10.sql RETURN from… PGConf US, NYC 26-Mar-201552
  • 53. }  Use TriGrams , the pg_trgm extension, with a list of words in the documents to identify words that are close to input queries as suggestions for misspelled terms }  Step 1: create a table of words CREATE TABLE fts_word AS (SELECT word FROM TS_STAT( 'SELECT tsv_document FROM fts_amer_hist')) UNION (SELECT word FROM TS_STAT( 'SELECT tsv_document FROM fts_data_sci') ); Query-term spelling suggestions with TriGrams
  • 54. }  Step 2: create an index CREATE INDEX idx_fts_word ON fts_word USING GIN(word gin_trgm_ops); }  Step 3: query for 'close' terms that exist in the corpus SELECT word, sml FROM fts_word, SIMILARITY(word, 'asymetric') AS sml WHERE sml > 0.333 -- arbitrary value to filter results Check it out in action … Query-term spelling suggestions with TriGrams
  • 55. }  SELECT word, sml FROM fts_word, SIMILARITY(word, 'asymetric') AS sml -- 'asymmetric' is the correct spelling WHERE sml > 0.333 -- arbitrary value to filter results ORDER BY sml DESC, word; word sml ========== ======== asymetr 0.636364 asymmetri 0.538462 asymmetr 0.461538 Metric 0.416667 Suggested spelling with TriGrams PGConf US, NYC 26-Mar-201555
  • 56. }  Infrastructure and data wrangling }  Creating FTS tables with maintenance triggers and load our data }  Compare FTS with traditional SQL searches and run FTSs on documents from early American History }  Rank search results on documents from Data Science }  Generate HTML-tagged fragments with matching terms }  Customize the stop-word dictionary }  Suggest spelling options for query terms }  Re-write queries at run-time All slides, data and scripts are on the on PGConf web site Summary and review PGConf US, NYC 26-Mar-201556
  • 57. LIFE. LIBERTY. TECHNOLOGY. Freedom Consulting Group is a talented, hard-working, and committed partner, providing hardware, software and database development and integration services to a diverse set of clients. “ ”
  • 58. POSTGRES innovation ENTERPRISE reliability 24/7 support Services & training Enterprise-class features, tools & compatibility Indemnification Product road-map Control Thousands of developers Fast development cycles Low cost No vendor lock-in Advanced features Enabling commercial adoption of Postgres
  • 59. Are there any Questions or follow up? PGConf US, NYC 26-Mar-201559 freedomconsultinggroup.com/jhanson
  • 60. Freedom Consulting Group www.freedomconsultinggroup.com Jamey Hanson jhanson@freedomconsultinggroup.com jamesphanson@yahoo.com