Sphinx is a standalone, full-text search daemon that allows advanced searching over large collections of blocks of text, either from a database or as documents on a file system. Sphinx can scale to billions of documents while still providing sub-second results to boolean queries, wildcards and other advanced search features. I cover basic setup, building a simple index, and demonstrate how to use SQL queries to retrieve results through its API.
1. Search Server
Sphinx is an open source full text search server, designed from
the ground up with performance, relevance (a.k.a. search
quality), and integration simplicity in mind.
• Craigslist serves 200 million queries/day
• Used by Slashdot, Mozilla, Meetup
• Scales to billions of documents (distributed)
• Support almost any data source (SQL, XML, etc.)
• Batch and real-time indexes
By Andrew Kandels
2. What is a Search Server?
Sphinx is like a database because…
• It has a schema
• It has field types (integer, boolean, strings, dates)
• It responds to queries (SQL, API):
SELECT * FROM Books WHERE MATCH(“a rose by any other name”)
3. Documents
Sphinx indexes data from just about any source.
SELECT
CONCAT(a.first_name, ' ', a.last_name) AS full_name,
COUNT(b.book_id) AS num_books,
MIN(b.publish_date) AS first_published
FROM
author a
INNER JOIN book b
ON a.author_id = b.author_id
<?xml version=“1.0”?>
<author>
<id>1433</id>
<name>Mark Twain</name>
<books>
<book>A Connecticut Yankee in King Arthur’s Court</book>
</books>
</author>
4. How it Works
Sphinx parses plain text queries and answers with rows.
Search
@author_id 15 “Mark Twain” king << arthur
Results
1. document=1433, weight=1692, createdAt=Jan 1 1889
5. Relevance
Only the strongest will survive; but, relevance is in the
eye of the beholder. Some factors include:
• How many times did our keywords match?
• How many times did they repeat in the query?
• How frequently do keywords appear?
• Do keywords in the document appear in the same order as
the query?
• Did we match exactly, or is it a stemmed match?
6. B-Tree Index
User Index (Last Name (4))
First Name Last Name City State Notes Row # Contents
Allison Janney Baltimore MA Cregg 1 Jann
John Spencer Des Moines IA McGarry 5 Molo
Bradley Whitford Newport VA Lyman 6 Schi
Martin Sheen Seattle WA Bartlett 4 Shee
Janel Moloney Hollywood CA Moss 2 Spen
Richard Schiff Lincoln NE Ziegler 3 Whit
A B-tree is a tree data structure that keeps data sorted and allows searches,
sequential access, insertions, and deletions in logarithmic time.
7. Logical Queries
Logical conditions return a boolean result based on an
expression:
country = “United States”
AND num_published >= 50
AND (author_id = 5 OR author_id = 8 OR author_id = 10)
Logic queries can be complex and typically evaluate based on
the whole value of a column.
8. Stemming
Stemming (a.k.a. morphology) is the process for reducing inflected or derived
words to their stem, base or root form.
For example, “dove” is a synonym for “pigeon”. The words are different; but they
can mean the same thing.
9. Tokenizing
Sphinx breaks down documents into keywords. This is called tokenization.
Word breaker characters allow exception cases for keywords like AT&T, C++ or T-
Mobile.
Short words are ignored (by default, words less than 3 characters) but a placeholder
is saved to support proximity and phrase searching.
10. Full Text Index
Inversion
Document Index (Full Text)
A man caught a fish [spacer]
man, person, human, being
caught, catch, catcher, catching, catches
[spacer]
fish, fishing, fished, fisher
Metadata
man 2 1
caught 3 1
fish 5 1
11. Full Text Queries
Searches multiple columns or within contents in columns, also known as Keyword
Searching.
Boolean Search fiction AND (Twain OR Dickens)
Phrase Search “Mark Twain”
Field-Based Search @author_id 15
Proximity Search “fear itself”~2, fear << itself
Substring Search @author[4] Mark
Quorum Search “the world is a wonderful place”/3
Same Sentence/Paragraph fear SENTENCE itself
13. Important Files and Binaries
A successful Sphinx installation will yield the following:
searchd The search daemon, answers queries
Indexer Collects documents and builds the index
search Performs a search (useful for debugging)
sphinx.conf Defines your data and configures your
indexes and daemon
15. Sphinx.conf Blocks
The contents of sphinx.conf consists of several named blocks:
source Defines your data source and queries
index Define sources to index searches for
indexer Configure the indexer utility
searchd Configure the search daemon
16. Source
Define the connection to your database and query in the source block.
source filmssource
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass =
sql_db = sakila
sql_query =
SELECT f.film_id, f.title, f.description,
f.release_year, f.rating, l.name as language
FROM film f
INNER JOIN language l
ON l.language_id = f.language_id
sql_attr_uint = release_year
sql_attr_string = rating
sql_attr_string = language
}
17. Index
Define which sources to include and index parameters:
index films
{
source = filmssource
charset_type = utf-8
path = /home/andrew/sphinx/films
stopwords = /home/andrew/sphinx/stopwords.txt
enable_star = 1
min_word_len = 2
min_prefix_len = 0
min_infix_len = 2
}
20. stopwords.txt
To generate stopwords from your data, use the indexer binary:
indexer --config /path/to/sphinx.conf
--buildstops /path/to/stopwords.txt 25
of
who
must
in
and
the
mad
An
Builds a stopwords.txt file with the 25 most commonly found words.
Use --buildfreqs to include counts.
Stopwords can dramatically reduce the index size and time-to-build; but, it’s a
good idea to inspect the output before using it!
21. Build your Index
To generate your index, use the indexer binary:
indexer --config /path/to/sphinx.conf --all –rotate
Sphinx 2.0.4-release (r3135)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file 'sphinx.conf`...
indexing index 'films'...
collected 1000 docs, 0.1 MB
sorted 0.3 Mhits, 100.0% done
total 1000 docs, 108077 bytes
total 0.148 sec, 727012 bytes/sec, 6726.80 docs/sec
total 3 reads, 0.003 sec, 675.6 kb/call avg, 1.1 msec/call avg
total 11 writes, 0.004 sec, 331.8 kb/call avg, 0.4 msec/call avg
22. Start the Server
Start the server by executing the searchd binary:
searchd --config /path/to/sphinx.conf
Sphinx 2.0.4-release (r3135)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file 'sphinx.conf’...
listening on 127.0.0.1:9312
listening on 127.0.0.1:9306
precaching index 'films'
precached 1 indexes in 0.001 sec
23. Run a Search
Test your index by running a search:
search --limit 3 robot
Sphinx 2.0.4-release (r3135)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file './sphinx.conf'...
index 'films': query 'robot ': returned 77 matches of 77 total in 0.000 sec
displaying matches:
1. document=138, weight=1612, release_year=2006, rating=R, language=English
2. document=920, weight=1612, release_year=2006, rating=G, language=English
3. document=6, weight=1581, release_year=2006, rating=PG, language=English
words:
1. 'robot': 77 documents, 79 hits
24. MySQL Interface
You can query Sphinx using the MySQL protocol:
mysql –h127.0.0.1 –P 9306
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor. Commands end with ; or g.
Your MySQL connection id is 1
Server version: 2.0.4-release (r3135)
Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license
Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
mysql>
25. MySQL Interface
Queries are written in SphinxQL, which is much like SQL:
mysql> SELECT *
FROM films
WHERE MATCH('robot')
ORDER BY release_year DESC
LIMIT 5;
+------+--------+--------------+--------+----------+
| id | weight | release_year | rating | language |
+------+--------+--------------+--------+----------+
| 6 | 1581 | 2006 | PG | English |
| 16 | 1581 | 2006 | NC-17 | English |
| 25 | 1581 | 2006 | G | English |
| 42 | 1581 | 2006 | NC-17 | English |
| 61 | 1581 | 2006 | G | English |
+------+--------+--------------+--------+----------+
5 rows in set (0.00 sec)
26. MySQL Interface
Additional metrics can also be retrieved:
mysql> SHOW META;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 77 |
| total_found | 77 |
| time | 0.000 |
| keyword[0] | robot |
| docs[0] | 77 |
| hits[0] | 79 |
+---------------+-------+
6 rows in set (0.00 sec)
27. MySQL Interface
You can even do grouping:
mysql> SELECT rating, COUNT(*) AS num_movies,
MIN(release_year) AS first_year
FROM films
GROUP BY rating
ORDER BY num_movies DESC;
+------+--------+--------------+--------+------------+--------+
| id | weight | release_year | rating | first_year | @count |
+------+--------+--------------+--------+------------+--------+
| 7| 1| 2006 | PG-13 | 2006 | 223 |
| 3| 1| 2006 | NC-17 | 2006 | 210 |
| 8| 1| 2006 | R | 2006 | 195 |
| 1| 1| 2006 | PG | 2006 | 194 |
| 2| 1| 2006 | G | 2006 | 178 |
+------+--------+--------------+--------+------------+--------+
5 rows in set (0.00 sec)
28. Other Applications
Sphinx does more than just full text search. It has other practical
applications as well:
• Metrics and Reporting
• Data Warehouse
• Materialized Views
• Operational Data Store
• Offloading Queries
29. Quick and Dirty PHP
Integrate Sphinx by using any MySQL driver (like PDO):
30. SphinxAPI
Or use a native extension like SphinxClient for PHP:
Download it here: http://pecl.php.net/sphinx
32. Main+delta Batch Indexes
Disk indexes often use the main+delta(s) strategy:
• One or more delta indexes collect new data as often as every minute.
• Larger batch indexes rebuild daily, weekly or even less frequently.
Disk indexes have the following benefits:
• They can be re-indexed online without interruption (--rotate)
• They can be distributed over filesystems and hardware
33. The End
There’s a book! Andrew Kandels
Website: http://andrewkandels.com
Twitter: @andrewkandels
Facebook/G+: No thanks