Enhance discovery Solr and Mahout

Thinking
Lucene

Think
Lucid

Enhancing
Discovery
with
Solr
and

Mahout

Grant
Ingersoll

Chief
Scien@st

Lucid
Imagina@on

CONFIDENTIAL

|

1

Evolution

Documents
• Models
• Feature Selection

User
Interaction
Content
• Clicks
Relationships • Ratings/
• Page Rank, etc. Reviews
• Organization • Learning to
Rank
• Social Graph

Queries
• Phrases
• NLP

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

2

Minding the Intersection

Search

Analytics Discovery

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

3

Topics

l  Background

–  Apache
Mahout

–  Apache
Solr
and
Lucene

l  Recommenda@ons
with
Mahout

–  Collabora@ve
Filtering

l  Discovery
with
Solr
and
Mahout

l  Discussion

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

4

Apache
Lucene
in
a
Nutshell

l  hOp://lucene.apache.org/java

l  Java
based
Applica@on
Programming
Interface
(API)
for
adding
search
and

indexing
func@onality
to
applica@ons

l  Fast
and
eﬃcient
scoring
and
indexing
algorithms

l  Lots
of
contribu@ons
to
make
common
tasks
easier:

–  Highligh@ng,
spa@al,
Query
Parsers,
Benchmarking
tools,
etc.

l  Most
widely
deployed
search
library
on
the
planet

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

5

Apache
Solr
in
a
Nutshell

l  hOp://lucene.apache.org/solr

l  Lucene-‐based
Search
Server
+
other
features
and
func@onality

l  Access
Lucene
over
HTTP:

–  Java,
XML,
Ruby,
Python,
.NET,
JSON,
PHP,
etc.

l  Most
programming
tasks
in
Lucene
are
taken
care
of
in
Solr

l  Face@ng
(guided
naviga@on,
ﬁlters,
etc.)

l  Replica@on
and
distributed
search
support

l  Lucene
Best
Prac@ces

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

6

Apache
Mahout
in
a
Nutshell

http://dictionary.reference.com/browse/mahout

l  An
Apache
Socware
Founda@on
project
to
create

scalable
machine
learning
libraries
under
the
Apache

Socware
License

–  hOp://mahout.apache.org

l  The
Three
C’s:

–  Collabora@ve
Filtering
(recommenders)

–  Clustering

–  Classiﬁca@on

l  Others:

–  Frequent
Item
Mining

–  Primi@ve
collec@ons

–  Math
stuﬀ

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

7

Thinking
Lucene

Think
Lucid

Recommenda@ons
with
Mahout

CONFIDENTIAL

|

8

Recommenders

l  Collabora@ve
Filtering
(CF)

–  Provide
recommenda@ons
solely
based
on
preferences
expressed
between

users
and
items

–  “People
who
watched
this
also
watched
that”

l  Content-‐based
Recommenda@ons
(CBR)

–  Provide
recommenda@ons
based
on
the
aOributes
of
the
items
and
user
proﬁle

–  ‘Modern
Family’
is
a
sitcom,
Bob
likes
sitcoms

•  =>
Suggest
Modern
Family
to
Bob

l  Mahout
geared
towards
CF,
can
be
extended
to
do
CBR

–  Classiﬁca@on
can
also
be
used
for
CBR

l  Aside:
search
engines
can
also
solve
these
problems

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

9

To
Rate
or
Not?

l  In
many
instances,
user’s
don’t
provide
actual
ra@ngs

–  Clicks,
views,
etc.

l  Non-‐Boolean
ra@ngs
can
also
ocen
introduce
unnecessary
noise

–  Even
a
low
ra@ng
ocen
has
a
posi@ve
correla@on
with
highly
rated
items
in
the

real
world

l  Example:

Should
we
recommend
Frankenstein
to
Bob?

Dracula
Dracula Jane Frankenstein
Jane Eyre Java Programming
Frankenstein
Eyre
Bob 1 4 ???
Bob 1 4 ??? -
Mary 5 1 4
Mary 5 1 4 -

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

10

Collabora;ve
Filtering
with
Mahout

Item Item … Item m
l  Extensive
framework
for
collabora@ve

1 2
filtering

User 1 - 0.5 0.9
l  Recommenders

–  User
based
User 2 0.1 0.3 -
–  Item
based
…
–  Slope
One

User n 0.8 0.7 0.1
l  Online
and
Offline
support

–  Offline
can
u@lize
Hadoop

Recommendations
for User X

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

11

User
Similarity

What
should
we
recommend
for
User
1?

User
User

1
2
User

3
User

4

Item
1
Item
2
Item
3
Item
4

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

12

Item
Similarity

What
should
we
recommend
for
User
1?

User
User

1
2
User

3
User

4

Item
1
Item
2
Item
3
Item
4

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

13

Slope
One

User Item 1 Item 2
A 3.5 2
B ? 3

User
A:
3.5
–
2
=
1.5

Item
1
(User
B)
=
3
+
1.5
=
4.5

l  Intui@on:
There
is
a
linear
rela@onship
between
rated
items

–  Y
=
mX
+
b

where
m
=
1

l  Solve
for
b
upfront
based
on
exis@ng
ra@ngs:

b
=
(Y-‐X)

–  Find
the
average
diﬀerence
in
preference
value
for
every
pair
of
items

l  Online
can
be
very
fast,
but
requires
up
front
computa@on
and
memory

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

14

Online
and
Oﬄine
Recommenda;ons

l  Online

–  Predates
Hadoop

–  Designed
to
run
on
a
single
node

•  Matrix
size
of
~
100M
interac@ons

–  API
for
integra@ng
with
your
applica@on

l  Oﬄine

–  Hadoop
based

–  Designed
to
run
on
large
cluster

–  Several
approaches:

•  RecommenderJob,
ItemSimilarityJob,
ParallelALSFactoriza@onJob

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

15

RecommenderJob

l  Essen@ally
does
matrix
mul@plica@on
using
distributed
techniques

l  $MAHOUT_HOME/bin/examples/asf-‐email-‐examples.sh

101 102 103 104 105 User A Recs
3.0 30
101 7 2 0 1 3
0 37
102 2 8 3 5 2
X
4.0 =

103 0 3 3 6 4 38

104 1 5 6 4 7 3.0 53

105 3 2 4 7 9 2.0 64

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

16

Thinking
Lucene

Think
Lucid

Discovery
with
Solr

CONFIDENTIAL

|

17

Discovery
with
Solr

l  Goals:

–  Guide
users
to
results
without
having
to
guess
at
keywords

–  Encourage
serendipity

–  Never
show
empty
results

l  Out
of
the
Box:

–  Face@ng

–  Spell
Checking

–  More
Like
This

–  Clustering
(Carrot2)

l  Extend

–  Clustering
(with
Mahout)

–  Frequent
Item
Mining
(with
Mahout)

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

18

Clustering

l  Automa@cally
group
similar
content
together
to
aid
users
in
discovering

related
items
and/or
avoiding
repe@@ve
content

l  Solr
has
search
result
clustering

–  Pluggable

–  Default
implementa@on
uses
Carrot2

l  Mahout
has
Hadoop
based
large
scale
clustering

–  K-‐Means,
Minhash,
Dirichlet,
Canopy,
Spectral,
etc.

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

19

Discovery
In
Ac;on

l  Pre-‐reqs:

–  Apache
Ant
1.7.x,
Subversion
(SVN)

l  Command
Line
1:

–  svn
co
hOps://svn.apache.org/repos/asf/lucene/dev/trunk
solr-‐trunk

–  cd
solr-‐trunk/solr/

–  ant
example

–  cd
example

–  java
–Dsolr.clustering.enabled=true
–jar
start.jar

l  Command
Line
2

–  cd
exampledocs;
java
–jar
post.jar
*.xml

l  hOp://localhost:8983/solr/browse?
q=&debugQuery=true&annotateBrowse=true

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

20

Thinking
Lucene

Think
Lucid

Solr
+
Mahout

CONFIDENTIAL

|

21

Basics

l  Most
Mahout
tasks
are
oﬄine

l  Solr
provides
many
touch
points
for
integra@on:

–  ClusteringEngine

•  Clustering
results

–  SearchComponent

•  Sugges@ons
–
Related
searches,
clusters,
MLT,
spellchecking

–  UpdateProcessor

•  Classiﬁca@on
of
documents

–  Func@onQuery

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

22

Example:
Frequent
Itemset
Mining

l  Discover
frequently
co-‐occurring
items

l  Use
Case:
Related
Searches
from
Solr
Logs

l  Hadoop
and
sequen@al
versions

–  Parallel
FP
Growth

l  Input:

–  <op@onal
document
id>TAB<TOKEN1>SPACE<TOKEN2>SPACE

–  Comma,
pipe
also
allowed
as
delimiters

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

23

FIM
on
Solr
Query
Logs

l  Goal:

–  Extract
user
queries
from
Solr
logs

–  Feed
into
FIM
to
generate
Related
Keyword
Searches

l  Context:

–  Solr
Query
logs

–  bin/mahout
regexconverter
–input
$PATH_TO_LOGS
-‐-‐output
/tmp/solr/output

-‐-‐regex
"(?<=(?|&)q=).*?(?=&|$)"
-‐-‐overwrite
-‐-‐transformerClass
url
-‐-‐
formaOerClass
fpg

–  bin/mahout
fpg
-‐-‐input
/tmp/solr/output/
-‐o
/tmp/solr/ﬁm/output
-‐k
25
-‐s
2
-‐-‐
method
mapreduce

–  bin/mahout
seqdumper
-‐-‐seqFile
/tmp/solr2/results/frequentpaOerns/part-‐
r-‐00000

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

24

Output

l  Key:
Chris:
Value:
([Chris,
HosteOer],870),
([Chris],870),
([Search,
Faceted,

Chris,
HosteOer,
Webcast,
Power,
Mastering],18),
([Search,
Faceted,
Chris,

HosteOer,
Webcast,
Power],18),
([Search,
Faceted,
Chris,
HosteOer],18),

([Solr,
new,
Chris,
HosteOer,
webcast,
along,
sponsors,
DZone,
QA,
Refcard],
12),
([Solr,
new,
Chris,
HosteOer,
webcast,
along,
sponsors,
DZone],12),

([Solr,
new,
Chris,
HosteOer,
webcast,
along,
sponsors],12),
([Solr,
new,

Chris,
HosteOer,
webcast,
along],12),
([Solr,
new,
Chris,
HosteOer,
webcast],
12),
([Solr,
new,
Chris,
HosteOer],12)

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

25

Resources

l  hOp://lucene.apache.org

l  hOp://mahout.apache.org

l  hOp://manning.com/owen

l  hOp://manning.com/ingersoll

l  hOp://www.lucidimagina@on.com

l  grant@lucidimagina@on.com

l  @gsingers

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

26

Thinking
Lucene

Think
Lucid

Appendix

CONFIDENTIAL

|

27

Mahout
Overview

Applications

Examples

Freq.
Genetic Pattern Classification Clustering Recommenders
Mining

Math
Utilities/Integration Collections Apache
Vectors/Matrices/
Lucene/Vectorizer (primitives) Hadoop
SVD

See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

28

Enhance discovery Solr and Mahout

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Enhance discovery Solr and Mahout

Semelhante a Enhance discovery Solr and Mahout (20)

Mais de lucenerevolution

Mais de lucenerevolution (20)

Último

Último (20)

Enhance discovery Solr and Mahout