2. Agenda
¡ Introduc9ons
¡ What
is
Solr?
¡ Main
Solr
Features
and
A@ributes
¡ Content,
Query,
Facet,
API,
Scalability
¡ Interface
and
useful
commands
¡ Live
Demo
3. Introduc9on
— Search
has
become
mission
cri9cal
for
most
enterprises
— Intranet
— Web
presence
— E-‐commerce
— Exponen9al
growth
of
data
— Cost
of
not
finding
informa9on
— Knowledge
(sharing)
— Time
— Money
— Informa9on
blackhole
4. What
is
Solr?
Official
defini,on:
“Solr
is
an
open
source
enterprise
search
pla7orm
based
on
the
Lucene
Java
search
library,
with
an
HTTP
interface
using
XML,
JSON
or
other
formats.
It
provides
hit
highligh,ng,
faceted
search,
caching,
replica,on,
a
web
administra,on
interface
and
many
more
features.
It
runs
in
a
Java
servlet
container
such
as
Apache
Tomcat.”
— h#p://lucene.apache.org/solr
5. What
is
Solr?
— In
2004,
Solr
was
created
by
Yonik
Seeley
at
CNET
Networks
as
an
in-‐house
project
to
add
search
capability
for
the
company
website.
— Open-‐source,
license-‐free
search
engine
— Built
on
top
of
Apache
Lucene
library,
and
adds
enterprise
search
server
features
and
capabili9es
— Web
based
applica9on
that
processes
requests
and
returns
responses
via
HTTP,
and
APIs
6. Why
choosing
Solr?
— Customizable
— High
quality
and
easily
modifiable
relevancy
— Very
fast
query
and
indexing
performance
— Open
source
so^ware
is
free
— Highly
flexible
data
processing/transforma9on
— Easy
scalability
and
great
performance
— Modern
solu9on
architecture
based
on
XML
and
Java
— Well
integrated
with
the
ecosystem
around
Big
Data,
such
as
Hadoop
(also
Nutch,
Tika)
7. Solr’s
Main
Features
— Full
text
search
— Field
search
— Number
and
date
searching
— Facets
— Spelling
assistance
–
“Did
you
mean…?”
— Related
hits
— Query
comple9on
— Admin
GUI
— Data
Import
Handler
— Index
Databases,
Mails,
RSS,
XMLs
etc.
— Rich
document
support
— PDF,
MS
Office,
Images
etc
— Replica9on
for
high
query
volume
— Distributed
search
for
large
indexes
— Produc9on
systems
with
1B+
documents
— Very
extensible
and
customizable
— Embedded
in
commercial
search
products
from
LucidWorks,
DataStax,
Cloudera,
Hortonworks,
Amazon
CloudSearch
and
Riak
9. Content
— Out
of
the
box
support
for
JSON
— Solr
handles
CSV,
XML,
Rich
Content
out
of
the
box
without
having
to
install
plugins
10. Indexing
and
Ranking
— Solr
use
Inverted
index
— For
ranking,
solr
use
TF-‐IDF
and
Similarity
— Similarity
is
a
combina9on
of
Boolean
model
(BM)
and
Vector
Space
Model
(VSM)
— Another
feature,
user
can
do
re-‐rank
to
the
query
12. Facet
“Faceted
search
is
the
dynamic
clustering
of
items
or
search
results
into
categories
that
let
users
drill
into
search
results
(or
even
skip
searching
en9rely)
by
any
value
in
any
field.
“
— Naviga9on/discovery
technique
— Tally
of
docs
for
each
dis9nct
field
value
— Parameters
— &facet=true
— &facet.field=category
13. API
— REST
API
for
adding
field
types,
and
dynamic
fields
— Managing
Request
Handlers
through
API
— Improved
APIs
for
managing
collec9ons
— Implicit
registra9on
of
replica9on,
Real
Time
Get
and
Administra9on
Handlers
— Out
of
the
box
support
for
JSON
— Solr
handles
CSV,
XML,
Rich
Content
out
of
the
box
without
having
to
install
plugins
14. Scalability
— Architecture
goals:
— More
queries
per
second
(qps)
— Faster
query
execu9on
— Bigger
indexes
— Faster
indexing
— Scaling
op9ons
— Mul9core
— Replica9on
— Sharding