4. Topics
l Background
– Apache
Mahout
– Apache
Solr
and
Lucene
l Recommenda@ons
with
Mahout
– Collabora@ve
Filtering
l Discovery
with
Solr
and
Mahout
l Discussion
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
4
5. Apache
Lucene
in
a
Nutshell
l hOp://lucene.apache.org/java
l Java
based
Applica@on
Programming
Interface
(API)
for
adding
search
and
indexing
func@onality
to
applica@ons
l Fast
and
efficient
scoring
and
indexing
algorithms
l Lots
of
contribu@ons
to
make
common
tasks
easier:
– Highligh@ng,
spa@al,
Query
Parsers,
Benchmarking
tools,
etc.
l Most
widely
deployed
search
library
on
the
planet
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
5
6. Apache
Solr
in
a
Nutshell
l hOp://lucene.apache.org/solr
l Lucene-‐based
Search
Server
+
other
features
and
func@onality
l Access
Lucene
over
HTTP:
– Java,
XML,
Ruby,
Python,
.NET,
JSON,
PHP,
etc.
l Most
programming
tasks
in
Lucene
are
taken
care
of
in
Solr
l Face@ng
(guided
naviga@on,
filters,
etc.)
l Replica@on
and
distributed
search
support
l Lucene
Best
Prac@ces
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
6
7. Apache
Mahout
in
a
Nutshell
http://dictionary.reference.com/browse/mahout
l An
Apache
Socware
Founda@on
project
to
create
scalable
machine
learning
libraries
under
the
Apache
Socware
License
– hOp://mahout.apache.org
l The
Three
C’s:
– Collabora@ve
Filtering
(recommenders)
– Clustering
– Classifica@on
l Others:
– Frequent
Item
Mining
– Primi@ve
collec@ons
– Math
stuff
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
7
9. Recommenders
l Collabora@ve
Filtering
(CF)
– Provide
recommenda@ons
solely
based
on
preferences
expressed
between
users
and
items
– “People
who
watched
this
also
watched
that”
l Content-‐based
Recommenda@ons
(CBR)
– Provide
recommenda@ons
based
on
the
aOributes
of
the
items
and
user
profile
– ‘Modern
Family’
is
a
sitcom,
Bob
likes
sitcoms
• =>
Suggest
Modern
Family
to
Bob
l Mahout
geared
towards
CF,
can
be
extended
to
do
CBR
– Classifica@on
can
also
be
used
for
CBR
l Aside:
search
engines
can
also
solve
these
problems
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
9
10. To
Rate
or
Not?
l In
many
instances,
user’s
don’t
provide
actual
ra@ngs
– Clicks,
views,
etc.
l Non-‐Boolean
ra@ngs
can
also
ocen
introduce
unnecessary
noise
– Even
a
low
ra@ng
ocen
has
a
posi@ve
correla@on
with
highly
rated
items
in
the
real
world
l Example:
Should
we
recommend
Frankenstein
to
Bob?
Dracula
Dracula Jane Frankenstein
Jane Eyre Java Programming
Frankenstein
Eyre
Bob 1 4 ???
Bob 1 4 ??? -
Mary 5 1 4
Mary 5 1 4 -
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
10
11. Collabora;ve
Filtering
with
Mahout
Item Item … Item m
l Extensive
framework
for
collabora@ve
1 2
filtering
User 1 - 0.5 0.9
l Recommenders
– User
based
User 2 0.1 0.3 -
– Item
based
…
– Slope
One
User n 0.8 0.7 0.1
l Online
and
Offline
support
– Offline
can
u@lize
Hadoop
Recommendations
for User X
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
11
12. User
Similarity
What
should
we
recommend
for
User
1?
User
User
1
2
User
3
User
4
Item
1
Item
2
Item
3
Item
4
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
12
13. Item
Similarity
What
should
we
recommend
for
User
1?
User
User
1
2
User
3
User
4
Item
1
Item
2
Item
3
Item
4
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
13
14. Slope
One
User Item 1 Item 2
A 3.5 2
B ? 3
User
A:
3.5
–
2
=
1.5
Item
1
(User
B)
=
3
+
1.5
=
4.5
l Intui@on:
There
is
a
linear
rela@onship
between
rated
items
– Y
=
mX
+
b
where
m
=
1
l Solve
for
b
upfront
based
on
exis@ng
ra@ngs:
b
=
(Y-‐X)
– Find
the
average
difference
in
preference
value
for
every
pair
of
items
l Online
can
be
very
fast,
but
requires
up
front
computa@on
and
memory
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
14
15. Online
and
Offline
Recommenda;ons
l Online
– Predates
Hadoop
– Designed
to
run
on
a
single
node
• Matrix
size
of
~
100M
interac@ons
– API
for
integra@ng
with
your
applica@on
l Offline
– Hadoop
based
– Designed
to
run
on
large
cluster
– Several
approaches:
• RecommenderJob,
ItemSimilarityJob,
ParallelALSFactoriza@onJob
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
15
18. Discovery
with
Solr
l Goals:
– Guide
users
to
results
without
having
to
guess
at
keywords
– Encourage
serendipity
– Never
show
empty
results
l Out
of
the
Box:
– Face@ng
– Spell
Checking
– More
Like
This
– Clustering
(Carrot2)
l Extend
– Clustering
(with
Mahout)
– Frequent
Item
Mining
(with
Mahout)
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
18
19. Clustering
l Automa@cally
group
similar
content
together
to
aid
users
in
discovering
related
items
and/or
avoiding
repe@@ve
content
l Solr
has
search
result
clustering
– Pluggable
– Default
implementa@on
uses
Carrot2
l Mahout
has
Hadoop
based
large
scale
clustering
– K-‐Means,
Minhash,
Dirichlet,
Canopy,
Spectral,
etc.
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
19
20. Discovery
In
Ac;on
l Pre-‐reqs:
– Apache
Ant
1.7.x,
Subversion
(SVN)
l Command
Line
1:
– svn
co
hOps://svn.apache.org/repos/asf/lucene/dev/trunk
solr-‐trunk
– cd
solr-‐trunk/solr/
– ant
example
– cd
example
– java
–Dsolr.clustering.enabled=true
–jar
start.jar
l Command
Line
2
– cd
exampledocs;
java
–jar
post.jar
*.xml
l hOp://localhost:8983/solr/browse?
q=&debugQuery=true&annotateBrowse=true
Copyright
Lucid
Imagina@on
CONFIDENTIAL
|
20