3. Report Buyer product catalogue:
• Text fields: title, subtitle, summary, toc
• Product code and ISBN
• Supplier, category, type and availability
• Publication date and price
4. Enterprise class search engine
Scalable and based on Apache Lucene
REST-ful API or PECL extension
Fast, transactional full-text indexing
Faceted and geospatial search
Rich document indexing
Comes with simple web interface
Built-in caching of queries and responses
Numerous plug-ins
5. Available as system packages
Uses Tomcat or Jetty
Requires a restart on configuration change
Packages install as a service
13. More features than other products
Responsive, busy mailing list
Large team of developers
Good PHP libraries for integration
Several books available
Fairly heavy footprint
14. Also built on Apache Lucene
JSON-based
Distributed, scalable server model
Easy to configure, or configuration free
Faceting and highlight support
Auto type detection
Multiple indexes
CouchDB integration
15. Download and unpack zip file
Run elasticsearch/bin/elasticsearch
16. No schema is required - almost
No configuration is required - almost
17. GET http://localhost:9200/ HTTP/1.0
{
"ok" : true,
"name" : "Test",
"version" : {
"number" : "0.18.7",
"snapshot_build" : false
},
"tagline" : "You Know, for Search",
"cover" : "DON'T PANIC",
"quote" : {
"book" : "The Hitchhiker's Guide to the Galaxy",
"chapter" : "Chapter 27",
"text1" : ""Forty-two," said Deep Thought, with infinite majesty and calm.",
"text2" : ""The Answer to the Great Question, of Life, the Universe and Everything""
}
}
23. Nicholas Ruflin's elastica
Raymond Julin's elasticsearch
Niranjan Uma Shankar's elasticsearch-php
24. Very fast indexing
Auto-scaling architecture
Elegant REST approach
Flexible zero configuration model
Poor documentation
No feature list, conceptual model or
introduction
All data is stored, meaning large indices
25. Indexes MySQL, MSSQL, XML or ODBC
Querying through Sphinx PHP API
Searching through SQL queries or API
Scalable to index 6TB of data in 16bn
documents and 2000 queries/sec
Used by Craigslist, Boardreader
Runs as a storage engine in MySQL
26. Install from system packages or source
Source tarball is needed to get PHP
SphinxAPI
No other software needed
Runs as a service in Ubuntu
27. Plain index - fast search, slow update
Real-time index - fast update, less efficient
Distributed - combination of both methods
28. index rb_test
{
# index type
type = rt
path = /mnt/data_indexed/sphinx/rb_test
# define the fields we're indexing
rt_field = name
rt_field = subtitle
rt_field = summary
rt_field = toc
#define the fields we want to get back out
rt_attr_string = item_guid
rt_attr_string = supplier
rt_attr_string = product_code
rt_attr_string = isbn
rt_attr_string = category
rt_attr_uint = price
rt_attr_string = link
rt_attr_timestamp = publish_date
# morphology preprocessors to apply
morphology = stem_en
html_strip =1
html_index_attrs = img=alt,title; a=title;
html_remove_elements = style, script
}
30. mysql --host=127.0.0.1 --port=9306
Welcome to the MySQL monitor. Commands end with ; or g.
Your MySQL connection id is 1
Server version: 2.0.3-id64-release (r3043)
mysql> select item_guid, title, subtitle, price from rb_search where match('china pharmaceutical') and price
> 100 and price < 300 limit 2G
************************** 1. row ***************************
id: 5228810066049016302
weight: 6671
price: 220
item_guid: cc74cb075aa37696198e87850f033398
title: North China Pharmaceutical Group Corp-Therapeutic Competitors Report
subtitle:
*************************** 2. row ***************************
id: 3548867347418583847
weight: 6662
price: 190
item_guid: 6ce04df0fb277aa3ff596c2ca00c81a9
title: China Pharmaceutical Industry Report
subtitle: 2006-2007
2 rows in set (0.01 sec)
31. Fastest indexing of all engines
Really simple interface via SQL
Document IDs must be unsigned integers
No faceting support
Good support in forums
32. Deployed as a C++ library
Bindings provided to connect to PHP
Available in most package repositories
Binding need to be compiled separately
Query Parser, similar to other engines
Stemming and faceted search
Server replication
33. Install from system packages
Compile PHP bindings from source
No other software needed
Runs on demand
34. No configuration required
Define-and-go schema
Documents
Terms
Values
Document data
35. <?php
$xapian_db = new XapianWritableDatabase($xapian, Xapian::DB_CREATE_OR_OVERWRITE);
$xapian_term_generator = new XapianTermGenerator();
$xapian_term_generator->set_stemmer(new XapianStem("english"));
while ($row = mysql_fetch_array($result, MYSQL_ASSOC)) {
$doc = new XapianDocument();
$xapian_term_generator->set_document($doc);
foreach ($xapian_term_weights as $field => $weight) {
$xapian_term_generator->index_text($row[$field], $weight);
}
$xapian_term_generator->index_text($row['name'], 75, 'S:');
$doc->add_boolean_term('CODE:' . $row['product_code']);
$doc->add_value($xapian_value_slots['price'], Xapian::sortable_serialise($row['price']));
$doc->add_value($xapian_value_slots['publish_date'], strftime("%Y%m%d",
strtotime($row['publish_date'])));
// add in additional values that we're going to use for facets
$doc->add_value($xapian_value_slots['availability'], $row['availability']);
$doc->set_data(serialize($doc_data));
$docid = 'Q'.$row['item_guid'];
$xapian_db->replace_document($docid, $doc);
}
?>
36. <?php
$xapian_db = new XapianDatabase($xapian);
$query_parser = new XapianQueryParser();
$query_parser->set_stemmer(new XapianStem("english"));
$query_parser->set_default_op(XapianQuery::OP_AND);
$dvrProcessor = new XapianDateValueRangeProcessor($xapian_value_slots['publish_date'], 'date:');
$query_parser->add_valuerangeprocessor($dvrProcessor);
$query_parser->add_prefix("code", "CODE:");
$query_parser->add_prefix("category", "CATEGORY:");
$query_parser->add_prefix("title", "S:");
$query = $query_parser->parse_query('“Medical devices” NEAR china NOT russian price:10..150 category:medical');
$enquire = new XapianEnquire($xapian_db);
$enquire->set_query($query);
$matches = $enquire->get_mset($offset, $pagesize);
while (!($start->equals($end))) {
$doc = $start->get_document();
$price = Xapian::sortable_unserialise($doc->get_value($xapian_value_slots['price']));
$start->next();
}?>
37. Only one option available from Xapian
Requires additional compilation due to
licensing
Not very well documented API
38. Reasonably fast indexing
Very flexible implementation
Faceting and range searching
Good Quick Start guide
Responsive mailing list
Third-party paid support
39. Every project has different needs
Not one search product fits all
Fastest to index was Sphinx
Most feature-rich: Solr
The next steps are up to you