SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
NoSQL: Apache SOLR

                                                Apache Hadoop
                       By Dmitry Kan for NerdCamp, April 23 2011
dmitry.kan@gmail.com
Dilbert: expert in NoSQL
•The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQL
movement "departs from the relational model altogether; it should
therefore have been called more appropriately 'NoREL', or something to
that effect.“ (wikipedia)
•NoSQL = Not Only SQL
•Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google


•Data storage: billion gigabytes (GB) of data
•Interconnected data: hyperlinks, blog pingbacks, social networks
•Complex Data structure: hierarchical nested data structures easily
(multiple relational tables in SQL)
•Performance: the more data in SQL, the likely it to degrade


•NoSQL is not:
    •… SQL and not relational
    •… replacement for SQL, but compliment
    •... There is no fixed schema and no joins
    •... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales-
    out” (spreading the load over many commodity systems) – horizontal
    scaling
NoSQL Categories

•Key-value Stores: bigh hashtable with caching mechanisms
•Column Family Stores: keys point to multiple columns (Google’s BigTable)
•Document Databases: documents are collections of other key-value
collections
•Graph Databases: nodes, relationships between nodes and nodes props

Major NoSQL players
•Dynamo: Amazon.com, key-value, used in Amazon S3 (simple storage
service)
•Cassandra: open-sourced by Facebook, column oriented NoSQL DB
•BigTable: Google’s proprietary column oriented DB (App Engine)
•CouchDB: OS document oriented NoSQL DB (as well as MongoDB)
•Neo4j: OS graph DB

Querying NoSQL DB:
•Data model specific
•RESTful interfaces or query APIs
•SPARQL: declarative query specification for graph DBs
Simple Protocol And RDFQuery Language
(courtesy of about.com and IBM)
Example of retrieving the URL of a blogger

PREFIX foaf <http://xmlns.com/foaf/0.1/>
SELECT ?url
FROM <bloggers.rdf>
WHERE {
?contributor foaf:name "Jon Foobar" .
?contributor foaf:weblog ?url .
}




  stats!
Some stats from (Information Week) via
about.com (2010):
•44% biz IT professionals haven’t heard of NoSQL
•1%: NoSQL is strategic direction

•Some stats from NerdCamp (April 2011):
•10% heard and used the NoSQL
•Much more people know about cloud, which can
become more and more a driving platform behind
NoSQL


Does the world of NoSQL have enough mass to
appeal to IT now?
“Solr is the popular, blazing
                                                fast open source enterprise
                                                search platform from the
                                                Apache Lucene project.”

                                                Created by Yonik Seeley at
                                                CNET

                                                Features:
                                                •Full-text search
                                                •Hit highlighting
http://lucene.apache.org/solr/                  •Faceted search (Dynamic
http://lucene.apache.org/solr/tutorial.html     clustering)
http://lucene.apache.org/java/docs/index.html   •DB integration
                                                •Rich doc handling
Books                                           •Geospatial search
                                                •Distributed search
                                                •Replicataion
                                                •REST-like HTTP/XML & JSON
                                                APIS
drupal



Companies using SOLR
Curent version: Apache Solr 3.1 (March 31, 2011)   Operating system support
 License: ASL 2.0                                   All with a Java VM, including:
 Features:                                          Linux (all versions)
 •Faceted navigation                                Windows (all versions)
 •Hit highlighting                                  MacOS (all versions)
 •GEO search: filter and sort by distance           Unix variants
 •Spellcheck and auto suggest                       App-server support
 •Advanced ranking and sorting                      Apache Tomcat, Jetty, Resin,
 •Distributed and replicated search                 WebLogic™, WebSphere™,
 •Structured / unstructured search                  GlassFish, dmServer™, JBoss™
 •Rich plugin architecture, extensible              and many more
                                                    Java version requirement
                                                    Java JDK 1.5 or later
                                                    Client API support
                                                    Java, .NET, PHP, Python, Ruby
                                                    (on
                                                    Rails), C++, XML/HTTP,
Overview of current state                           JSON/HTTP ++


April 2011
Faceted search
•A technique for refining search results
•Concept composition:
    • Article + in English + about nerdcamp
    • Finnish rap + < 1 minute + released in 2001


•Types:
    • Standard facets (list of facets with values)
    • Hierarchical facet values (taxonomy of facet
      values)
    • Range / query facets: by date, by price, by
      alphabet, by interval
Spatial Search

Combines location data with text data
•Represent spatial data in the index
•Filter by some spatial concept such as a bounding box or other shape
•Sort by distance
•Score/boost by distance

•<field name="store">45.17614,-93.87341</field> <!-- Buffalo store -->
<field name="store">40.7143,-74.006</field> <!-- NYC store -->
<field name="store">37.7752,-122.4232</field> <!-- San Francisco store --
>

•bbox: bounding box filter (bbox is a range of lats and lons that
encompasses the circle of radius d)
•geodist: the distance function
Hit highlighting

Example from solr admin
Spellcheck and autosuggest

Spellcheck:
•Query suggestion for a missspelled query term
http://localhost:8983/solr/spell?q=hell
ultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=tru
e
<lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <int
name="numFound">1</int> <int name="startOffset">0</int> <int
name="endOffset">4</int> <arr name="suggestion"> <str>dell</str>
</arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int>
<int name="startOffset">5</int> <int name="endOffset">14</int> <arr
name="suggestion"> <str>ultrasharp</str> </arr> </lst> <str
name="collation">dell ultrasharp</str> </lst> </lst>

Autosuggest:
Example with solr and jquery
Advanced sorting, ranking and searching

•sort=score+asc
•sort=Author+desc,score+desc
•boosting single documents

•Term Frequency—tf
•Inverse Document Frequency – idf
•Co-ordination Factor – coord (the greater the # of queried terms match,
the greater the score)
•Field Length – fieldNorm (the shorter the matching field is in number of
indexed terms, the greater the document’s score)

•AND, OR, NOT, NEAR, fuzzy search
•Smashing~0.7 yields more results than just Smashing
Distributed and replicated search




Before doing this:
•Consider vertical scaling (faster and better machine)
•Rethink the data model (what data goes to which solr index)
•Remove logging on updates (and / or searches)
•Redesign you index: make as many fields non-indexed and non-stored (use cases)
•Check your Internet connection
Extendability
Plugins:
•Query parser: extend LuceneQParserPlugin

public class NerdCampQParserPlugin extends LuceneQParserPlugin {
public QParser createParser(String qstr, SolrParams localParams,
                  SolrParams params, SolrQueryRequest req) {}

}
SOLR I/O
•Nutch (crawler)
•CSV, XML, DataImportHandlers, DB import, Apache Tika (rich document
import, like pdf), your format

•Output: xml, json, python, javabin, csv… , your format
SOLR Processing Pipeline
•On each step, a document gets transformed
•Stop words removal
•Stemming
•(smart) Tokenization
•Ngrams (letter level and word level)
•Regular expressions
•Low casing
•Reversed wildcard
•Duplicate removal
Solr on the cloud
Hadoop: MapReduce
ZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your Zoo
Batch indexing, no realtime search yet




 Hadoop vital components: Core and API

 MapReduce -- computation model
 HDFS
 I/O
 ZooKeeper
 Pig (adds level of abstraction for processing
 large datasets)
Solr on the cloud
Does it shine? Yes, but not fully
References
[1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, About.com Guide
Sarah Pidcock (2011-01-31). http://bit.ly/fFQOYI
[2] "Dynamo: Amazon’s Highly Available Key-value Store".
http://www.cs.uwaterloo.ca/:
WATERLOO. p. 2/22. Retrieved 2011-04-05.
"Dynamo: a highly available and scalable distributed data store"
[3] http://cassandra.apache.org/
[4] http://labs.google.com/papers/bigtable.html
[5] http://aws.amazon.com/ (look for SimpleDB)
[6] http://couchdb.apache.org/
[7] http://neo4j.org/
[8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQL
http://bit.ly/go5ios
[9] http://drupal.org/
[10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination
[11] http://wiki.apache.org/solr/SpatialSearch
[12] http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html
[13] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
References
[14] Using Nutch with SOLR,
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
[15] http://tika.apache.org/
[16] http://lucene.apache.org/solr/

Mais conteúdo relacionado

Mais procurados

SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...Lucidworks
 
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Erik Hatcher
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big featuresDavid Smiley
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Not Just ORM: Powerful Hibernate ORM Features and Capabilities
Not Just ORM: Powerful Hibernate ORM Features and CapabilitiesNot Just ORM: Powerful Hibernate ORM Features and Capabilities
Not Just ORM: Powerful Hibernate ORM Features and CapabilitiesBrett Meyer
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
 
Building a High Performance Environment for RDF Publishing
Building a High Performance Environment for RDF PublishingBuilding a High Performance Environment for RDF Publishing
Building a High Performance Environment for RDF Publishingdr0i
 
eZ Find workshop: advanced insights & recipes
eZ Find workshop: advanced insights & recipeseZ Find workshop: advanced insights & recipes
eZ Find workshop: advanced insights & recipesPaul Borgermans
 
DSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDuraSpace
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
Solr Flair: Search User Interfaces Powered by Apache Solr
Solr Flair: Search User Interfaces Powered by Apache SolrSolr Flair: Search User Interfaces Powered by Apache Solr
Solr Flair: Search User Interfaces Powered by Apache SolrErik Hatcher
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Lucidworks
 

Mais procurados (20)

SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
 
Solr Flair
Solr FlairSolr Flair
Solr Flair
 
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
 
Discovery Interfaces
Discovery InterfacesDiscovery Interfaces
Discovery Interfaces
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Not Just ORM: Powerful Hibernate ORM Features and Capabilities
Not Just ORM: Powerful Hibernate ORM Features and CapabilitiesNot Just ORM: Powerful Hibernate ORM Features and Capabilities
Not Just ORM: Powerful Hibernate ORM Features and Capabilities
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Building a High Performance Environment for RDF Publishing
Building a High Performance Environment for RDF PublishingBuilding a High Performance Environment for RDF Publishing
Building a High Performance Environment for RDF Publishing
 
eZ Find workshop: advanced insights & recipes
eZ Find workshop: advanced insights & recipeseZ Find workshop: advanced insights & recipes
eZ Find workshop: advanced insights & recipes
 
DSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & ConfigurationDSpace 4.2 Basics & Configuration
DSpace 4.2 Basics & Configuration
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Solr Flair: Search User Interfaces Powered by Apache Solr
Solr Flair: Search User Interfaces Powered by Apache SolrSolr Flair: Search User Interfaces Powered by Apache Solr
Solr Flair: Search User Interfaces Powered by Apache Solr
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
 
Solr 4
Solr 4Solr 4
Solr 4
 
How Solr Search Works
How Solr Search WorksHow Solr Search Works
How Solr Search Works
 

Destaque

Presentation solr 10 Aout 2011 (french)
Presentation solr 10 Aout 2011 (french)Presentation solr 10 Aout 2011 (french)
Presentation solr 10 Aout 2011 (french)Thibaud Vibes
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation systemDmitry Kan
 
Automatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryAutomatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryDmitry Kan
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)Dmitry Kan
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupDmitry Kan
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupDmitry Kan
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1Dmitry Kan
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsDmitry Kan
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for EnglishDmitry Kan
 
Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageDmitry Kan
 
Linguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageLinguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageDmitry Kan
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationDmitry Kan
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine TranslationDmitry Kan
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing ArchitectureGang Tao
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Dmitry Kan
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Dmitry Kan
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesDmitry Kan
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageDmitry Kan
 
Semantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesSemantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesDmitry Kan
 

Destaque (20)

Presentation solr 10 Aout 2011 (french)
Presentation solr 10 Aout 2011 (french)Presentation solr 10 Aout 2011 (french)
Presentation solr 10 Aout 2011 (french)
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Semantic feature machine translation system
Semantic feature machine translation systemSemantic feature machine translation system
Semantic feature machine translation system
 
Automatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational DictionaryAutomatic Build Of Semantic Translational Dictionary
Automatic Build Of Semantic Translational Dictionary
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
 
Lucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeupLucene revolution eu 2013 dublin writeup
Lucene revolution eu 2013 dublin writeup
 
Social spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer GroupSocial spam detection by SemanticAnalyzer Group
Social spam detection by SemanticAnalyzer Group
 
Introduction To Machine Translation 1
Introduction To Machine Translation 1Introduction To Machine Translation 1
Introduction To Machine Translation 1
 
Solr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwordsSolr onfitnesse learningfromberlinbuzzwords
Solr onfitnesse learningfromberlinbuzzwords
 
Starget sentiment analyzer for English
Starget sentiment analyzer for EnglishStarget sentiment analyzer for English
Starget sentiment analyzer for English
 
Linguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian languageLinguistic component Sentiment Analyzer for the Russian language
Linguistic component Sentiment Analyzer for the Russian language
 
Linguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian languageLinguistic component Lemmatizer for the Russian language
Linguistic component Lemmatizer for the Russian language
 
MTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine TranslationMTEngine: Semantic-level Crowdsourced Machine Translation
MTEngine: Semantic-level Crowdsourced Machine Translation
 
Introduction To Machine Translation
Introduction To Machine TranslationIntroduction To Machine Translation
Introduction To Machine Translation
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011Rule based approach to sentiment analysis at ROMIP 2011
Rule based approach to sentiment analysis at ROMIP 2011
 
Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...Poster: Method for an automatic generation of a semantic-level contextual tra...
Poster: Method for an automatic generation of a semantic-level contextual tra...
 
Rule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slidesRule based approach to sentiment analysis at romip’11 slides
Rule based approach to sentiment analysis at romip’11 slides
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
 
Semantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use casesSemantic Analysis: theory, applications and use cases
Semantic Analysis: theory, applications and use cases
 

Semelhante a NoSQL, Apache SOLR and Apache Hadoop

An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1Stefan Schmidt
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...OpenBlend society
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
Scaling with swagger
Scaling with swaggerScaling with swagger
Scaling with swaggerTony Tam
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetupgregchanan
 
Building APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformBuilding APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformAntonio Peric-Mazar
 
CosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersCosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersNiko Neugebauer
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxMichael Hackstein
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBJustin Smestad
 

Semelhante a NoSQL, Apache SOLR and Apache Hadoop (20)

An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Solr
SolrSolr
Solr
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Scaling with swagger
Scaling with swaggerScaling with swagger
Scaling with swagger
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Building APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformBuilding APIs in an easy way using API Platform
Building APIs in an easy way using API Platform
 
CosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersCosmosDB for DBAs & Developers
CosmosDB for DBAs & Developers
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
 
Rapid API Development ArangoDB Foxx
Rapid API Development ArangoDB FoxxRapid API Development ArangoDB Foxx
Rapid API Development ArangoDB Foxx
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 

Mais de Dmitry Kan

London IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesLondon IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesDmitry Kan
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural searchDmitry Kan
 
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Dmitry Kan
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source stateDmitry Kan
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaDmitry Kan
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_crDmitry Kan
 
Computer Semantics And Machine Translation
Computer Semantics And Machine TranslationComputer Semantics And Machine Translation
Computer Semantics And Machine TranslationDmitry Kan
 

Mais de Dmitry Kan (7)

London IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use casesLondon IR Meetup - Players in Vector Search_ algorithms, software and use cases
London IR Meetup - Players in Vector Search_ algorithms, software and use cases
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
 
IR: Open source state
IR: Open source stateIR: Open source state
IR: Open source state
 
SentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social mediaSentiScan: система автоматической разметки тональности в social media
SentiScan: система автоматической разметки тональности в social media
 
Icsoft 2011 51_cr
Icsoft 2011 51_crIcsoft 2011 51_cr
Icsoft 2011 51_cr
 
Computer Semantics And Machine Translation
Computer Semantics And Machine TranslationComputer Semantics And Machine Translation
Computer Semantics And Machine Translation
 

Último

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

NoSQL, Apache SOLR and Apache Hadoop

  • 1. NoSQL: Apache SOLR Apache Hadoop By Dmitry Kan for NerdCamp, April 23 2011 dmitry.kan@gmail.com
  • 3. •The acronym NoSQL was coined in 1998 (Carlo Strozzi): as the NoSQL movement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL', or something to that effect.“ (wikipedia) •NoSQL = Not Only SQL •Companies: Facebook, Twitter, Digg, Amazon, LinkedIn and Google •Data storage: billion gigabytes (GB) of data •Interconnected data: hyperlinks, blog pingbacks, social networks •Complex Data structure: hierarchical nested data structures easily (multiple relational tables in SQL) •Performance: the more data in SQL, the likely it to degrade •NoSQL is not: •… SQL and not relational •… replacement for SQL, but compliment •... There is no fixed schema and no joins •... Does not ”scale-up” (RDBMS, vertical scaling), but rather ”scales- out” (spreading the load over many commodity systems) – horizontal scaling
  • 4. NoSQL Categories •Key-value Stores: bigh hashtable with caching mechanisms •Column Family Stores: keys point to multiple columns (Google’s BigTable) •Document Databases: documents are collections of other key-value collections •Graph Databases: nodes, relationships between nodes and nodes props Major NoSQL players •Dynamo: Amazon.com, key-value, used in Amazon S3 (simple storage service) •Cassandra: open-sourced by Facebook, column oriented NoSQL DB •BigTable: Google’s proprietary column oriented DB (App Engine) •CouchDB: OS document oriented NoSQL DB (as well as MongoDB) •Neo4j: OS graph DB Querying NoSQL DB: •Data model specific •RESTful interfaces or query APIs •SPARQL: declarative query specification for graph DBs
  • 5. Simple Protocol And RDFQuery Language (courtesy of about.com and IBM) Example of retrieving the URL of a blogger PREFIX foaf <http://xmlns.com/foaf/0.1/> SELECT ?url FROM <bloggers.rdf> WHERE { ?contributor foaf:name "Jon Foobar" . ?contributor foaf:weblog ?url . } stats!
  • 6. Some stats from (Information Week) via about.com (2010): •44% biz IT professionals haven’t heard of NoSQL •1%: NoSQL is strategic direction •Some stats from NerdCamp (April 2011): •10% heard and used the NoSQL •Much more people know about cloud, which can become more and more a driving platform behind NoSQL Does the world of NoSQL have enough mass to appeal to IT now?
  • 7. “Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.” Created by Yonik Seeley at CNET Features: •Full-text search •Hit highlighting http://lucene.apache.org/solr/ •Faceted search (Dynamic http://lucene.apache.org/solr/tutorial.html clustering) http://lucene.apache.org/java/docs/index.html •DB integration •Rich doc handling Books •Geospatial search •Distributed search •Replicataion •REST-like HTTP/XML & JSON APIS
  • 9.
  • 10. Curent version: Apache Solr 3.1 (March 31, 2011) Operating system support License: ASL 2.0 All with a Java VM, including: Features: Linux (all versions) •Faceted navigation Windows (all versions) •Hit highlighting MacOS (all versions) •GEO search: filter and sort by distance Unix variants •Spellcheck and auto suggest App-server support •Advanced ranking and sorting Apache Tomcat, Jetty, Resin, •Distributed and replicated search WebLogic™, WebSphere™, •Structured / unstructured search GlassFish, dmServer™, JBoss™ •Rich plugin architecture, extensible and many more Java version requirement Java JDK 1.5 or later Client API support Java, .NET, PHP, Python, Ruby (on Rails), C++, XML/HTTP, Overview of current state JSON/HTTP ++ April 2011
  • 11. Faceted search •A technique for refining search results •Concept composition: • Article + in English + about nerdcamp • Finnish rap + < 1 minute + released in 2001 •Types: • Standard facets (list of facets with values) • Hierarchical facet values (taxonomy of facet values) • Range / query facets: by date, by price, by alphabet, by interval
  • 12. Spatial Search Combines location data with text data •Represent spatial data in the index •Filter by some spatial concept such as a bounding box or other shape •Sort by distance •Score/boost by distance •<field name="store">45.17614,-93.87341</field> <!-- Buffalo store --> <field name="store">40.7143,-74.006</field> <!-- NYC store --> <field name="store">37.7752,-122.4232</field> <!-- San Francisco store -- > •bbox: bounding box filter (bbox is a range of lats and lons that encompasses the circle of radius d) •geodist: the distance function
  • 14. Spellcheck and autosuggest Spellcheck: •Query suggestion for a missspelled query term http://localhost:8983/solr/spell?q=hell ultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=tru e <lst name="spellcheck"> <lst name="suggestions"> <lst name="hell"> <int name="numFound">1</int> <int name="startOffset">0</int> <int name="endOffset">4</int> <arr name="suggestion"> <str>dell</str> </arr> </lst> <lst name="ultrashar"> <int name="numFound">1</int> <int name="startOffset">5</int> <int name="endOffset">14</int> <arr name="suggestion"> <str>ultrasharp</str> </arr> </lst> <str name="collation">dell ultrasharp</str> </lst> </lst> Autosuggest: Example with solr and jquery
  • 15. Advanced sorting, ranking and searching •sort=score+asc •sort=Author+desc,score+desc •boosting single documents •Term Frequency—tf •Inverse Document Frequency – idf •Co-ordination Factor – coord (the greater the # of queried terms match, the greater the score) •Field Length – fieldNorm (the shorter the matching field is in number of indexed terms, the greater the document’s score) •AND, OR, NOT, NEAR, fuzzy search •Smashing~0.7 yields more results than just Smashing
  • 16. Distributed and replicated search Before doing this: •Consider vertical scaling (faster and better machine) •Rethink the data model (what data goes to which solr index) •Remove logging on updates (and / or searches) •Redesign you index: make as many fields non-indexed and non-stored (use cases) •Check your Internet connection
  • 17. Extendability Plugins: •Query parser: extend LuceneQParserPlugin public class NerdCampQParserPlugin extends LuceneQParserPlugin { public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {} }
  • 18. SOLR I/O •Nutch (crawler) •CSV, XML, DataImportHandlers, DB import, Apache Tika (rich document import, like pdf), your format •Output: xml, json, python, javabin, csv… , your format
  • 19. SOLR Processing Pipeline •On each step, a document gets transformed •Stop words removal •Stemming •(smart) Tokenization •Ngrams (letter level and word level) •Regular expressions •Low casing •Reversed wildcard •Duplicate removal
  • 20. Solr on the cloud Hadoop: MapReduce ZooKeeper: at least 3 Zoo Keepers to have 1-2 managing your Zoo Batch indexing, no realtime search yet Hadoop vital components: Core and API MapReduce -- computation model HDFS I/O ZooKeeper Pig (adds level of abstraction for processing large datasets)
  • 21. Solr on the cloud Does it shine? Yes, but not fully
  • 22. References [1] Tim Perdue: NoSQL: An Overview of NoSQL Databases, About.com Guide Sarah Pidcock (2011-01-31). http://bit.ly/fFQOYI [2] "Dynamo: Amazon’s Highly Available Key-value Store". http://www.cs.uwaterloo.ca/: WATERLOO. p. 2/22. Retrieved 2011-04-05. "Dynamo: a highly available and scalable distributed data store" [3] http://cassandra.apache.org/ [4] http://labs.google.com/papers/bigtable.html [5] http://aws.amazon.com/ (look for SimpleDB) [6] http://couchdb.apache.org/ [7] http://neo4j.org/ [8] Information Week: Surprise: 44% Of Business IT Pros Never Heard Of NoSQL http://bit.ly/go5ios [9] http://drupal.org/ [10] Mark Miller: Scaling Lucene and Solr // Lucid Imagination [11] http://wiki.apache.org/solr/SpatialSearch [12] http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html [13] http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
  • 23. References [14] Using Nutch with SOLR, http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ [15] http://tika.apache.org/ [16] http://lucene.apache.org/solr/