NoSQL (Not Only SQL) is believed to be a superset of, or sometimes an intersecting set with, relational SQL databases. The concept itself is still shaping, but already now we can say for sure: NoSQL addresses the task of storing and retrieving the data of large volumes in the systems with high load. There is another very important angle in perceiving the concept:
NoSQL systems can allow storing and efficient searching of the unstructured or semi-unstructured data, like completely raw or preprocessed documents. Using the example of one world-class document retrieval system Apache SOLR (performant HTTP wrapper around Apache Lucene) as a reference we will check upon its use cases, horizontal and vertical scalability, faceted search, distribution and load balancing, crawling, extendability, linguistic support, integration with relational databases and much more.
Dmitry Kan will shortly touch upon *hot* topic of cloud computing using the famous project Apache Hadoop and will help the audience to see whether SOLR shines through the cloud.
3. •The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQL
movement "departs from the relational model altogether; it should
therefore have been called more appropriately 'NoREL', or something to
that effect.“ (wikipedia)
•NoSQL = Not Only SQL
•Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google
•Data storage: billion gigabytes (GB) of data
•Interconnected data: hyperlinks, blog pingbacks, social networks
•Complex Data structure: hierarchical nested data structures easily
(multiple relational tables in SQL)
•Performance: the more data in SQL, the likely it to degrade
•NoSQL is not:
•… SQL and not relational
•… replacement for SQL, but compliment
•... There is no fixed schema and no joins
•... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales-
out” (spreading the load over many commodity systems) – horizontal
scaling
4. NoSQL Categories
•Key-value Stores: bigh hashtable with caching mechanisms
•Column Family Stores: keys point to multiple columns (Google’s BigTable)
•Document Databases: documents are collections of other key-value
collections
•Graph Databases: nodes, relationships between nodes and nodes props
Major NoSQL players
•Dynamo: Amazon.com, key-value, used in Amazon S3 (simple storage
service)
•Cassandra: open-sourced by Facebook, column oriented NoSQL DB
•BigTable: Google’s proprietary column oriented DB (App Engine)
•CouchDB: OS document oriented NoSQL DB (as well as MongoDB)
•Neo4j: OS graph DB
Querying NoSQL DB:
•Data model specific
•RESTful interfaces or query APIs
•SPARQL: declarative query specification for graph DBs
5. Simple Protocol And RDFQuery Language
(courtesy of about.com and IBM)
Example of retrieving the URL of a blogger
PREFIX foaf <http://xmlns.com/foaf/0.1/>
SELECT ?url
FROM <bloggers.rdf>
WHERE {
?contributor foaf:name "Jon Foobar" .
?contributor foaf:weblog ?url .
}
stats!
6. Some stats from (Information Week) via
about.com (2010):
•44% biz IT professionals haven’t heard of NoSQL
•1%: NoSQL is strategic direction
•Some stats from NerdCamp (April 2011):
•10% heard and used the NoSQL
•Much more people know about cloud, which can
become more and more a driving platform behind
NoSQL
Does the world of NoSQL have enough mass to
appeal to IT now?
7. “Solr is the popular, blazing
fast open source enterprise
search platform from the
Apache Lucene project.”
Created by Yonik Seeley at
CNET
Features:
•Full-text search
•Hit highlighting
http://lucene.apache.org/solr/ •Faceted search (Dynamic
http://lucene.apache.org/solr/tutorial.html clustering)
http://lucene.apache.org/java/docs/index.html •DB integration
•Rich doc handling
Books •Geospatial search
•Distributed search
•Replicataion
•REST-like HTTP/XML & JSON
APIS
10. Curent version: Apache Solr 3.1 (March 31, 2011) Operating system support
License: ASL 2.0 All with a Java VM, including:
Features: Linux (all versions)
•Faceted navigation Windows (all versions)
•Hit highlighting MacOS (all versions)
•GEO search: filter and sort by distance Unix variants
•Spellcheck and auto suggest App-server support
•Advanced ranking and sorting Apache Tomcat, Jetty, Resin,
•Distributed and replicated search WebLogic™, WebSphere™,
•Structured / unstructured search GlassFish, dmServer™, JBoss™
•Rich plugin architecture, extensible and many more
Java version requirement
Java JDK 1.5 or later
Client API support
Java, .NET, PHP, Python, Ruby
(on
Rails), C++, XML/HTTP,
Overview of current state JSON/HTTP ++
April 2011
11. Faceted search
•A technique for refining search results
•Concept composition:
• Article + in English + about nerdcamp
• Finnish rap + < 1 minute + released in 2001
•Types:
• Standard facets (list of facets with values)
• Hierarchical facet values (taxonomy of facet
values)
• Range / query facets: by date, by price, by
alphabet, by interval
12. Spatial Search
Combines location data with text data
•Represent spatial data in the index
•Filter by some spatial concept such as a bounding box or other shape
•Sort by distance
•Score/boost by distance
•<field name="store">45.17614,-93.87341</field> <!-- Buffalo store -->
<field name="store">40.7143,-74.006</field> <!-- NYC store -->
<field name="store">37.7752,-122.4232</field> <!-- San Francisco store --
>
•bbox: bounding box filter (bbox is a range of lats and lons that
encompasses the circle of radius d)
•geodist: the distance function
14. Spellcheck and autosuggest
Spellcheck:
•Query suggestion for a missspelled query term
http://localhost:8983/solr/spell?q=hell
ultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=tru
e
<lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <int
name="numFound">1</int> <int name="startOffset">0</int> <int
name="endOffset">4</int> <arr name="suggestion"> <str>dell</str>
</arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int>
<int name="startOffset">5</int> <int name="endOffset">14</int> <arr
name="suggestion"> <str>ultrasharp</str> </arr> </lst> <str
name="collation">dell ultrasharp</str> </lst> </lst>
Autosuggest:
Example with solr and jquery
15. Advanced sorting, ranking and searching
•sort=score+asc
•sort=Author+desc,score+desc
•boosting single documents
•Term Frequency—tf
•Inverse Document Frequency – idf
•Co-ordination Factor – coord (the greater the # of queried terms match,
the greater the score)
•Field Length – fieldNorm (the shorter the matching field is in number of
indexed terms, the greater the document’s score)
•AND, OR, NOT, NEAR, fuzzy search
•Smashing~0.7 yields more results than just Smashing
16. Distributed and replicated search
Before doing this:
•Consider vertical scaling (faster and better machine)
•Rethink the data model (what data goes to which solr index)
•Remove logging on updates (and / or searches)
•Redesign you index: make as many fields non-indexed and non-stored (use cases)
•Check your Internet connection
17. Extendability
Plugins:
•Query parser: extend LuceneQParserPlugin
public class NerdCampQParserPlugin extends LuceneQParserPlugin {
public QParser createParser(String qstr, SolrParams localParams,
SolrParams params, SolrQueryRequest req) {}
}
18. SOLR I/O
•Nutch (crawler)
•CSV, XML, DataImportHandlers, DB import, Apache Tika (rich document
import, like pdf), your format
•Output: xml, json, python, javabin, csv… , your format
19. SOLR Processing Pipeline
•On each step, a document gets transformed
•Stop words removal
•Stemming
•(smart) Tokenization
•Ngrams (letter level and word level)
•Regular expressions
•Low casing
•Reversed wildcard
•Duplicate removal
20. Solr on the cloud
Hadoop: MapReduce
ZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your Zoo
Batch indexing, no realtime search yet
Hadoop vital components: Core and API
MapReduce -- computation model
HDFS
I/O
ZooKeeper
Pig (adds level of abstraction for processing
large datasets)
21. Solr on the cloud
Does it shine? Yes, but not fully
22. References
[1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, About.com Guide
Sarah Pidcock (2011-01-31). http://bit.ly/fFQOYI
[2] "Dynamo: Amazon’s Highly Available Key-value Store".
http://www.cs.uwaterloo.ca/:
WATERLOO. p. 2/22. Retrieved 2011-04-05.
"Dynamo: a highly available and scalable distributed data store"
[3] http://cassandra.apache.org/
[4] http://labs.google.com/papers/bigtable.html
[5] http://aws.amazon.com/ (look for SimpleDB)
[6] http://couchdb.apache.org/
[7] http://neo4j.org/
[8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQL
http://bit.ly/go5ios
[9] http://drupal.org/
[10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination
[11] http://wiki.apache.org/solr/SpatialSearch
[12] http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html
[13] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
23. References
[14] Using Nutch with SOLR,
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
[15] http://tika.apache.org/
[16] http://lucene.apache.org/solr/