This document summarizes a presentation about using Apache Solr to build recommender systems and discover latent relationships in data. It discusses how Solr can index user preferences and transactions to find co-occurrences and make recommendations. Streaming expressions are presented as a way to calculate significance scores to identify meaningful patterns beyond simple counts. Emergent properties like "flarglewharbliness" are used as an example of relationships that exist beyond predefined categories, and the potential for Solr to autonomously discover such latent vocabularies is briefly discussed.
2. Anyone
can
build
a
Recsys
w/
Solr!
Doug
Turnbull
Relevance
Consultant,
OpenSource
ConnecIons
3. I’m now available in
book form!
https://www.manning.com/books/relevant-search
Discount code: relsearch (38% off)
http://opensourceconnections.com/about-us/doug-turnbull/
Me The company...
4. field
Body
term
laser
doc
2
<metadata>
doc
4
<metadata>
term
light
doc
2
<metadata>
term
lightsaber
doc
0
How do search engines work?
The answer can be found in your textbook…
OpenSource Connections
Book Index:
• Topics -> page no
• Very efficient tool – compare to
scanning the whole book!
Lucene uses an index:
• Tokens => document ids:
laser => [2, 4]
light => [2, 5]
lightsaber => [0, 1, 5, 7]
5. What's the point?
OpenSource Connections
Solr:
- A general purpose system for looking up content based on features that
describe them
Tokens aren't really words!
doc0: "I like the bananas"
Analysis
Analysis
term I:
doc 0
term lik
doc 0
term banan:
doc 0
[lik]
[banan]Search: "liked banana?"
[I] [lik] [banan]
6. TF*IDF -- measuring feature
weight
OpenSource Connections
term I:
doc 0:
freq: 5
doc 1:
freq: 7
doc 3:
freq: 4
term banan:
doc 0:
freq: 2
"Banana-ness" is pretty special
"I-ness" is not special
doc0:
tf==5
df==3
(raw) TF*IDF = 5/3 = 1.6667
doc0:
tf==2
df==1
(raw) TF*IDF = 2/1 = 2.0
Search is really
distributed feature
matching and
similarity
(text-oriented)
7. Search often stands in for human interactions
I have a craving for a nice
juicy cut of meat. What
might you recommend?
I have JUST the thing!
9. Modeling arbitrary feature
strength
OpenSource Connections
term juicy:
steak:
juiciness: 5
grapefruit:
juiciness: 7
orange:
juiciness: 4
term meaty:
burger:
meatiness: 2
What you want:
{
item: "steak",
juiciness: ["juicy", "juicy", "juicy"],
meatiness: ["meaty"]
}
Use term frequency as feature
strength:
{
item: "grapefruit",
juiciness: ["juicy", "juicy", "juicy", "juicy", "juicy"],
meatiness: [""]
}
(remember,
Solr doesn't
need to store
this)
10. TF*IDF -- measuring feature
weight
OpenSource Connections
term juicy:
doc 0:
freq: 5
doc 1:
freq: 7
doc 3:
freq: 4
term meaty:
doc 0:
freq: 2
"meaty-ness" is pretty special
"juicy-ness" is pretty non-special
doc0:
tf==5
df==3
(raw) TF*IDF = 5/3 = 1.6667
doc0:
tf==2
df==1
(raw) TF*IDF = 2/1 = 2.0
Search is really
distributed feature
matching and
similarity
11. Requesting something from my grocer
More juicy Less juicy
More meaty Less meaty
q=meaty juicy
Results: 1.
2.
3.
12. Recsys also stands in for human interactions
Hi Jane,
Recommend me
something?
Hmm…
<Tom likes limes, what is
similar to limes?>
13. recommendations
Use existing properties
of thing to recommend
similar things
juicy
citrus
More like this for
unstructured data
What features/tokens are
most representative of this
thing?
http://solr.quepid.com/solr/tmdb/select?q={!mlt%20qf=overview}97&fl=title,id,overview (movies like
juicy
citrus
(search)
Here's some ideas...
{
item: "lime",
juiciness: ["juicy", "juicy", "juicy"],
citrusness: ["citrus", "citrus", "citrus"],
meatiness: [""],
partyness: ["party"]
}
14. "Content Based" more-like-these
Use existing properties
of thing to recommend
similar things
juicy
meaty
citrus
http://solr.quepid.com/solr/tmdb/select?q={!mlt%20qf=overview}97&fl=title,id,overview (movies like
Here's some ideas...
Jane knows a few more things that Tom likes...
15. Personalization metadata
Index extra data alongside your
products
{
item: "hamburger",
preferred_by_genders: ["m", …],
preferred_by_ages: ["30_40"]
}
age:30_40
gender:m
http://solr.quepid.com/solr/tmdb/select?q={!mlt%20qf=overview}97&fl=title,id,overview (movies like
Here's some ideas...
Jane knows a few things about Tom
(30 yr old male)
16. But, Jane's intuition transcends
words!
age:30_40
gender:m
Currently we're stuck with predefined labels:
citrus juicy
meaty
We're curating using
known vocabularies
(can we describe everything?)
17. What we like often transcends words
There are emergent properties of our world that don't have names
Relative flarglewharbliness
More flarglewharbilyLess flarglewharbily
Diet Coke
18. What's a flarglewharble?
More flarglewharbilyLess flarglewharbily
fruit orange lemon banana mentos diet coke
tom X
sue X X X
charlie X X
clare X X
hal x x
Goes together
Diet Coke
19. Can search find the flargles?
q=(flargliwharbliness:very)
term
flarglewharble:
diet-‐coke:
flargleness:
4
mentos:
flargleness:
3
banana
flargleness:
1
Can we somehow build?
Diet Coke
20. personfood orange lemon banana mentos diet coke
tom X X
sue X X X X
charlie X X
clare X X
hal x x X
Goes together
flarglewharble!
Babies often use made-up words based
on emergent patterns in their universe
They are less committed to our
language
21. What's the point?
Collaborative filtering
Latent vocabulary
(the flarglewharbles)
Pure Search
Content-based Recs
Predefined vocabulary
Can Solr discover the latent/
emergent vocabularies?
22. Can Solr discover the latent/
emergent vocabularies?
Well first let's tell Solr about our users
{
user: "Sue"
foods_bought: ["lemon", "banana", "mentos", "diet coke"]
}
{
user: "Charlie"
foods_bought: ["banana", "mentos", "diet coke"]
}
23. Faceting?
We need a way to look across users and look for patterns
(analyze all the baskets that contain mentos)
q=foods_bought:mentos&facet=true&facet.field=foods_bought
facets:
mentos: 3
diet-coke: 3
banana: 2
Hmm:
- Bananas are globally popular
- Diet-coke is probably what matters
24. Counts don't work: importance of
significance
q=foods_bought:mentos&facet=true&facet.field=foods_bought
facets:
mentos: 3
diet-coke: 3
banana: 2
Diet Coke:
Global popularity: diet coke (3)
Local popularity: 3
Score: 3/3 = 1
Banana:
Global popularity: banana
(4)
Local popularity: 2
Score: 2/4 = 0.5
by-significance:
diet-coke: 1
banana: 0.5
25. Streaming Expressions
/select?q=*:*&facet=true&facet.field=liked_movies
But there's a new sheriff in town!
One option: we could go about and gather global doc freqs & compare those
locally.
Terms component another option… plugins...
Streaming expressions -- distributed stream
computation system on top of Solr Cloud
You must ALWAYS cross the streams!
27. Significance with streaming expr
/stream?expr=scoreNodes(facet(...)...)
scoreNodes(
select(
facet(movielens,
q="liked_movies:2571 OR liked_movies:4993",
buckets="liked_movies",
bucketSorts="count(*) desc",
bucketSizeLimit="100",
count(*)),
liked_movies as node,
"count(*)",
replace(collection, null, withValue=movielens),
replace(field, null, withValue=liked_movies))
)
1. facet (just like above, just with streaming expr)
2. select to format data for scoreNodes
3. scoreNodes to score using TF*IDF
Banana occurs in 2 documents here, 4 globally --
2/4 = 0.5
Diet coke occurs 2 documents here, 2 globally --
2/2 = 1.0
Thinking back on my
shoppers behaviors, here's
some other items you might
like:
(thanks Joel Bernstein!)
Diet Coke
28. Lots of power here
/stream?expr=scoreNodes(facet(...)...)
scoreNodes(
select(
facet(movielens,
q="juiciness_pref:juicy",
buckets="liked_movies",
bucketSorts="count(*) desc",
bucketSizeLimit="100",
count(*)),
liked_movies as node,
"count(*)",
replace(collection, null, withValue=movielens),
replace(field, null, withValue=liked_movies))
)
Find users that like juicy things, what do they like?
Perhaps bucket over the aisle they like?
Construct our query to focus on a date range?
Many insights
(thanks Joel Bernstein!)
29. Only glimpsing the underlying
pattern...
We're not enumerating the flarglewharbles, and the schlumblefumbles
More flarglewharbilyLess flarglewharbily
Diet Coke
More schlumblewumblyLess schumblewumbly
Diet Coke
30. Coming soon (Solr 6.3)
http://yonik.com/solr-6-3/
https://issues.apache.org/jira/browse/SOLR-9258
- Models for training classifiers
- Then in turn updating documents
Progress is being made!
- Clustering?