6. What is Solr Missing?
Not a
Database
Doesn’t
Cluster
Not
transparently
sharded
Requires ETL
to injest
application data
Doesn’t
Reindex
7. Confidential
7
OLTP DB Search Cluster
Your Application
DB API Search API
Your
ETL
Transactional
Workloads
Search
Workloads
Open Source Search Reference Architecture
9. DSE Search Reference Architecture
Confidential
9
Search
+
Cassandra
80
10
3050
70
60
40
20
Your Application
CQL
Easy CQL API
All the goodness of DataStax driver
Distributed, Replicated, Always On
Data locality and shared memory
• Automatic indexing on db insert
• Higher ingestion throughput
• Distributed query optimization
Compared to open source search
• No separate search cluster to manage
• Probably less total hardware required
• No “Split Brain” data inconsistencies
• No ETL or synch to build and maintain
• No app level data management code
17. Filter queries: These are awesome because the result set gets cached in memory.
SELECT * FROM amazon.metadata WHERE solr_query='{"q":"title:Noir~", "fq":"categories:Books", "sort":"title
asc"}' limit 10;
Faceting: Get counts of fields
SELECT * FROM amazon.metadata WHERE solr_query='{"q":"title:Noir~", "facet":{"field":"categories"}}' limit 10;
Geospatial Searches: Supports box and radius
SELECT * FROM amazon.clicks WHERE solr_query='{"q":"asin:*", "fq":"+{!geofilt pt="37.7484,-122.4156"
sfield=location d=1}"}' limit 10;
Joins: Not your relational joins. These queries 'borrow' indexes from other tables to add filter logic. These are
fast!
SELECT * FROM amazon.metadata WHERE solr_query='{"q":"*:*", "fq":"{!join from=asin to=asin force=true
fromIndex=amazon.clicks}area_code:415"}' limit 5;
Fun all in one.
SELECT * FROM amazon.metadata WHERE solr_query='{"q":"*:*", "facet":{"field":"categories"}, "fq":"{!join
from=asin to=asin force=true fromIndex=amazon.clicks}area_code:415"}' limit 5;
19. Confidential
1) Spin up a new C* Cluster with search enabled using the DSE installer.
$ sudo service dse cassandra -s
2) Run your schema DDL to create the C* keyspace and tables.
3) Run dse_tool on the videos table*
$ dsetool create_core keyspace.table generateResources=true reindex=true
4) Write a CQL query with a Solr Search in it.
SELECT * FROM keyspace.table
WHERE solr_query=‘column:*’
*This will create lucene indexes on ALL the columns in your table.
20. Behind the scenes…
dse_tool
schema.xml
solrconfig.xml
CQL Query
$ dsetool create_core killrvideo.videos generateResources=true
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
<types>
…
<fields>
<field indexed="true" multiValued="false" name="added_date" stored="true" type="TrieDateField"/>
<field indexed="true" multiValued="false" name="location" stored="true" type="TextField"/>
<field indexed="true" multiValued="false" name="preview_image_location" stored="true" type="TextField"/>
<field indexed="true" multiValued="false" name="name" termVectors="true" stored="true" type="TextField"/>
<field indexed="true" multiValued="true" name="tags" termVectors="true" stored="true" type="TextField"/>
<field indexed="true" multiValued="false" name="userid" stored="true" type="UUIDField"/>
<field indexed="true" multiValued="false" name="videoid" stored="true" type="UUIDField"/>
<field indexed="true" multiValued="false" name="location_type" stored="true" type="TrieIntField"/>
<field indexed="true" multiValued="false" name="description" termVectors="true" stored="true" type="TextField"/>
</fields>
<uniqueKey>videoid</uniqueKey>
</schema>
<!--
=======
Copyright DataStax, Inc.
Please see the included license file for details.
-->
<!--
For more details about configurations options that may appear in
this file, see http://wiki.apache.org/solr/SolrConfigXml.
-->
<config>
<!-- In all configuration below, a prefix of "solr." for class names
is an alias that causes solr to search appropriate packages,
including org.apache.solr.(search|update|request|core|analysis)
You may also specify a fully qualified Java classname if you
have your own custom plugins.
-->
…
SELECT * FROM killrvideo.videos
WHERE solr_query=‘name:*’
Search is everywhere. Tell story about kids with search and recommendations.
Properly written queries, on the other hand, are completely precise. A query will give you exactly the result you ask for. Exactly what you asked for, which may or not be exactly what you want. The hardest part of a query is asking the right question.
Search, on the other hand, is not precise. A dumb search just gives you a randomly ordered list of items in which your search term occurs. A smart search will provide you a ranked list of items that are strongly related to the search terms you entered, even if they do not match the exactly. This means that even though some of the results will doubtless be unrelated, and sometimes absurd, there is a very good chance that there is data you can use pretty high up in those results, even if you didn’t ask quite the right question. And there may well be relevant data you didn’t even think to ask for.
Being smart in this way is a key benefit of search. Queries cannot be smart. Queries must always give you exactly what you asked for. There can be no tolerance for serendipity in query results. Search can be smart, but query must be dumb and strictly obedient.
“Find me all webpages that contain “Cassandra” and “Optimization””
“Recommend artists who are “like” “Taylor Swift””
“Highlight the keywords “Awesome,” “Good,” and “Amazing.””
Find me all shoes with:
Size: 12
Price: $15 - $60
Brand: Nike
Lucene is the base to nearly every popular search engine out there, including elastic search (which is more a of fried taco shell)Open source , Fast, high performance, scalable search/IR library and was Initially developed by Doug Cutting (Also author of Hadoop) Provides advanced Search options like synonyms, stopwords, based on similarity, proximity. Faceted search (search by price, size, manufacturer, etc.) Geospatial search (combining location information, filtering by distance, and more)
Solr was Created by Yonik Seeley for CNET as a Enterprise Search platform for Apache Lucene
Open source, Highly reliable, scalable, fault tolerant. Its Support distributed Indexing (SolrCloud),Replication, and load balanced querying. Solr “wraps” the Lucene open sourc information retrieval engine
You customize Solr via configuration & plug-ins
Free-form text search includes wildcards and phrases
Query support includes filtering (ranges, geo-spatial) and sorting
You can run Solr as a webapp, or in stand-alone (embedded) mode
You make search requests using HTTP GET requests
Documents are added and deleted via HTTP POST requests
These are extremely efficient and fast structures that live on disk. Simple queries are typically fulfilled in < 1ms. More complex ones, < 10ms
Not a database- A user has to create an interface to add a database, no built in security, no SQL like language
Requires ETL- data needs to be moved from an DBMS or other system to be indexed, increasing operational complexity, risk and reducing time for the data to be available.
no automatic indexing- no reindex after lucene upgrade, to reindiex, you need to reinvest the data
must reindex data after upgrade
Not transparently sharded- shared is only available in SolrCloud, and is not transparently sharded.
Doesn’t cluster- no HA or failover builtin. no load rebalancing when need to scale out.
Cassandra has a lot of the things that Solr is missing, so the great minds at Datastax engineering put them together to create all the good things in one, easy to use package
What is cassandra? High level overview: OSS NoSQL database optimized for extreme OLTP workloads. I will proceed that you understand the basics of C* from here; but if you don’t, please check out some of the other great intro videos on C* available on DSA.
You get a lot of the benefits from using Solr on top of C* - No ETL! Just load the data in cassandra and it’s automatically replicated to Solr!
Data is automatically replicated from the Cassandra DC to the Solr DC
No single point of failure
No ETL
Let’s look closer at the how these 2 technologies are integrated. C Cassandra stores the data, and solr will create the lucene indexes required to perform searches on that particular node’s data.
Cassandra and Solr reside together on a single DSE Search node in the same JVM. Consequently, you will want to size these nodes a bit larger than a typical cassandra node, esp when it comes to memory. This also does cause a very busy garbage collector, so it is recommended that you pay close attention to your JVM and garbage collection settings
Review C* write path. The solr write path in DSE search is very similar. Coordinator communicates with the shard router to determine which node(s) get the write.
DSE figures out which shards to query
Favors local shards over remote shards
Reduces fan-out by using multiple shards on each node
This assumes replication factor > 1
Cassandra Router implementation called ShardRouter
DSE sends a distributed query to remote nodes
Uses random (shuffle) to pick which nodes by default
Other shard shuffle strategies available.
Each query has an fq=id:[token range start TO token range end]
This limits the request to a subset of the shards on a node
The data is first written to the ram buffer. From the ram buffer, based on the soft commit threshold, that data is written to index segments that live on disk, but are not fsynced. At the time the C* memtables flush to SSTables on disk, the data in the in-memory index segments are fsyncd and “hard committed” to disk.
In traditional OSS Solr, the ram buffer is not searchable so data is not available to be searched until it is soft commited. These soft commit are not durable, so if the JVM crashes before these are synced to disk, you may lose some of your indexes. Creates a delay in being able to read newly invested data and the Soft commit threshold needs to be tuned… this is typically 1 - 10second delay.
Datastax has modified OSS Solr so that the Ram buffer is now searchable, a bit like Cassandra memtables. This means that the data is available to be searched very shortly after being injested by Cassandra… less than 1 second in some cases. Now your real time application data will be available to the application to be searched without having to wait for soft commit. This will not only improve your application’s performance but also allow more data to be indexed in a shorter period of time.
“Find me all webpages that contain “Cassandra” and “Optimization””
“Recommend artists who are “like” “Taylor Swift””
“Highlight the keywords “Awesome,” “Good,” and “Amazing.””
Find me all shoes with:
Size: 12
Price: $15 - $60
Brand: Nike
This ONLY allows search on tags, and the “index” table needs to be maintained manually. This is fine if this is all you want to do, and is very common.
Location_type, currently ith only 2 values, will work with a2i, name on the other hand, would not. So how do you search by name?
Secondary indexes will allow you to search on the whole field, but those aren’t efficient for higher cardinality values, like email address
Secondary index creates additional data structures on each node that hold table partitions
Each ‘local index’ indexes values in rows stored locally
Query on an indexed column requires accessing ‘local indexes’ on all nodes
Expensive
Query on a partition key and an indexed column requires accessing a ‘local index’ on one (or a few) nodes
Efficient
But say you want to search the text in description or title? What now?