SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
OCTOBER	
  11-­‐14,	
  2016	
  	
  •	
  	
  BOSTON,	
  MA	
  
Anyone	
  can	
  build	
  a	
  Recsys	
  w/	
  Solr!	
  
Doug	
  Turnbull	
  
Relevance	
  Consultant,	
  OpenSource	
  ConnecIons	
  
I’m now available in
book form!
https://www.manning.com/books/relevant-search
Discount code: relsearch (38% off)
http://opensourceconnections.com/about-us/doug-turnbull/
Me The company...
field	
  Body	
  
	
  term	
  laser	
  
	
  	
  	
  	
  doc	
  2	
  
	
  <metadata>	
  	
  
	
  	
  	
  	
  doc	
  4	
  
	
  <metadata>	
  	
  	
  
	
  term	
  light	
  
	
  	
  	
  	
  doc	
  2	
  
	
  	
   	
  <metadata>	
  
	
  term	
  lightsaber	
  
	
  	
  	
  	
  doc	
  0	
  
How do search engines work?
The answer can be found in your textbook…
OpenSource Connections
Book Index:
•  Topics -> page no
•  Very efficient tool – compare to
scanning the whole book!
Lucene uses an index:
•  Tokens => document ids:
laser => [2, 4]
light => [2, 5]
lightsaber => [0, 1, 5, 7]
What's the point?
OpenSource Connections
Solr:
-  A general purpose system for looking up content based on features that
describe them
Tokens aren't really words!
doc0: "I like the bananas"
Analysis
Analysis
term I:
doc 0
term lik
doc 0
term banan:
doc 0
[lik]
[banan]Search: "liked banana?"
[I] [lik] [banan]
TF*IDF -- measuring feature
weight
OpenSource Connections
term I:
doc 0:
freq: 5
doc 1:
freq: 7
doc 3:
freq: 4
term banan:
doc 0:
freq: 2
"Banana-ness" is pretty special
"I-ness" is not special
doc0:
tf==5
df==3
(raw) TF*IDF = 5/3 = 1.6667
doc0:
tf==2
df==1
(raw) TF*IDF = 2/1 = 2.0
Search is really
distributed feature
matching and
similarity
(text-oriented)
Search often stands in for human interactions
I have a craving for a nice
juicy cut of meat. What
might you recommend?
I have JUST the thing!
Searching the market
q=(juiciness:juicy meatiness:meaty)
Modeling arbitrary feature
strength
OpenSource Connections
term juicy:
steak:
juiciness: 5
grapefruit:
juiciness: 7
orange:
juiciness: 4
term meaty:
burger:
meatiness: 2
What you want:
{
item: "steak",
juiciness: ["juicy", "juicy", "juicy"],
meatiness: ["meaty"]
}
Use term frequency as feature
strength:
{
item: "grapefruit",
juiciness: ["juicy", "juicy", "juicy", "juicy", "juicy"],
meatiness: [""]
}
(remember,
Solr doesn't
need to store
this)
TF*IDF -- measuring feature
weight
OpenSource Connections
term juicy:
doc 0:
freq: 5
doc 1:
freq: 7
doc 3:
freq: 4
term meaty:
doc 0:
freq: 2
"meaty-ness" is pretty special
"juicy-ness" is pretty non-special
doc0:
tf==5
df==3
(raw) TF*IDF = 5/3 = 1.6667
doc0:
tf==2
df==1
(raw) TF*IDF = 2/1 = 2.0
Search is really
distributed feature
matching and
similarity
Requesting something from my grocer
More juicy Less juicy
More meaty Less meaty
q=meaty juicy
Results: 1.
2.
3.
Recsys also stands in for human interactions
Hi Jane,
Recommend me
something?
Hmm…
<Tom likes limes, what is
similar to limes?>
recommendations
Use existing properties
of thing to recommend
similar things
juicy
citrus
More like this for
unstructured data
What features/tokens are
most representative of this
thing?
http://solr.quepid.com/solr/tmdb/select?q={!mlt%20qf=overview}97&fl=title,id,overview (movies like
juicy
citrus
(search)
Here's some ideas...
{
item: "lime",
juiciness: ["juicy", "juicy", "juicy"],
citrusness: ["citrus", "citrus", "citrus"],
meatiness: [""],
partyness: ["party"]
}
"Content Based" more-like-these
Use existing properties
of thing to recommend
similar things
juicy
meaty
citrus
http://solr.quepid.com/solr/tmdb/select?q={!mlt%20qf=overview}97&fl=title,id,overview (movies like
Here's some ideas...
Jane knows a few more things that Tom likes...
Personalization metadata
Index extra data alongside your
products
{
item: "hamburger",
preferred_by_genders: ["m", …],
preferred_by_ages: ["30_40"]
}
age:30_40
gender:m
http://solr.quepid.com/solr/tmdb/select?q={!mlt%20qf=overview}97&fl=title,id,overview (movies like
Here's some ideas...
Jane knows a few things about Tom
(30 yr old male)
But, Jane's intuition transcends
words!
age:30_40
gender:m
Currently we're stuck with predefined labels:
citrus juicy
meaty
We're curating using
known vocabularies
(can we describe everything?)
What we like often transcends words
There are emergent properties of our world that don't have names
Relative flarglewharbliness
More flarglewharbilyLess flarglewharbily
Diet Coke
What's a flarglewharble?
More flarglewharbilyLess flarglewharbily
fruit orange lemon banana mentos diet coke
tom X
sue X X X
charlie X X
clare X X
hal x x
Goes together
Diet Coke
Can search find the flargles?
q=(flargliwharbliness:very)
	
  term	
  flarglewharble:	
  
	
  	
  	
  	
  diet-­‐coke:	
  
	
  	
  	
  	
  	
  	
  flargleness:	
  4	
  
	
  	
  	
  	
  mentos:	
  
	
  	
  	
  	
  	
  	
  flargleness:	
  3	
  
	
  	
  	
  	
  banana	
  
	
  	
  	
  	
  	
  	
  flargleness:	
  1	
  
	
  	
  
Can we somehow build?
Diet Coke
personfood orange lemon banana mentos diet coke
tom X X
sue X X X X
charlie X X
clare X X
hal x x X
Goes together
flarglewharble!
Babies often use made-up words based
on emergent patterns in their universe
They are less committed to our
language
What's the point?
Collaborative filtering
Latent vocabulary
(the flarglewharbles)
Pure Search
Content-based Recs
Predefined vocabulary
Can Solr discover the latent/
emergent vocabularies?
Can Solr discover the latent/
emergent vocabularies?
Well first let's tell Solr about our users
{
user: "Sue"
foods_bought: ["lemon", "banana", "mentos", "diet coke"]
}
{
user: "Charlie"
foods_bought: ["banana", "mentos", "diet coke"]
}
Faceting?
We need a way to look across users and look for patterns
(analyze all the baskets that contain mentos)
q=foods_bought:mentos&facet=true&facet.field=foods_bought
facets:
mentos: 3
diet-coke: 3
banana: 2
Hmm:
-  Bananas are globally popular
-  Diet-coke is probably what matters
Counts don't work: importance of
significance
q=foods_bought:mentos&facet=true&facet.field=foods_bought
facets:
mentos: 3
diet-coke: 3
banana: 2
Diet Coke:
Global popularity: diet coke (3)
Local popularity: 3
Score: 3/3 = 1
Banana:
Global popularity: banana
(4)
Local popularity: 2
Score: 2/4 = 0.5
by-significance:
diet-coke: 1
banana: 0.5
Streaming Expressions
/select?q=*:*&facet=true&facet.field=liked_movies
But there's a new sheriff in town!
One option: we could go about and gather global doc freqs & compare those
locally.
Terms component another option… plugins...
Streaming expressions -- distributed stream
computation system on top of Solr Cloud
You must ALWAYS cross the streams!
Streaming Expressions
/stream?expr=scoreNodes(facet(...)...)
facet(movielens,
q="*:*",
buckets="liked_movies",
bucketSorts="count(*) desc",
bucketSizeLimit="100",
count(*))
Faceting with Streaming Expressions:
Output:
{
"result-set":
{"docs":[
{
"count(*)":55807,
"liked_movies":"318"},
{
"count(*)":52352,
"liked_movies":"296"},
{
"count(*)":50114,
"liked_movies":"593"}
Nodes to be transformed
Significance with streaming expr
/stream?expr=scoreNodes(facet(...)...)
scoreNodes(
select(
facet(movielens,
q="liked_movies:2571 OR liked_movies:4993",
buckets="liked_movies",
bucketSorts="count(*) desc",
bucketSizeLimit="100",
count(*)),
liked_movies as node,
"count(*)",
replace(collection, null, withValue=movielens),
replace(field, null, withValue=liked_movies))
)
1.  facet (just like above, just with streaming expr)
2.  select to format data for scoreNodes
3.  scoreNodes to score using TF*IDF
Banana occurs in 2 documents here, 4 globally --
2/4 = 0.5
Diet coke occurs 2 documents here, 2 globally --
2/2 = 1.0
Thinking back on my
shoppers behaviors, here's
some other items you might
like:
(thanks Joel Bernstein!)
Diet Coke
Lots of power here
/stream?expr=scoreNodes(facet(...)...)
scoreNodes(
select(
facet(movielens,
q="juiciness_pref:juicy",
buckets="liked_movies",
bucketSorts="count(*) desc",
bucketSizeLimit="100",
count(*)),
liked_movies as node,
"count(*)",
replace(collection, null, withValue=movielens),
replace(field, null, withValue=liked_movies))
)
Find users that like juicy things, what do they like?
Perhaps bucket over the aisle they like?
Construct our query to focus on a date range?
Many insights
(thanks Joel Bernstein!)
Only glimpsing the underlying
pattern...
We're not enumerating the flarglewharbles, and the schlumblefumbles
More flarglewharbilyLess flarglewharbily
Diet Coke
More schlumblewumblyLess schumblewumbly
Diet Coke
Coming soon (Solr 6.3)
http://yonik.com/solr-6-3/
https://issues.apache.org/jira/browse/SOLR-9258
-  Models for training classifiers
-  Then in turn updating documents
Progress is being made!
-  Clustering?
Questions?
The Flarglewharbles

Mais conteúdo relacionado

Mais procurados

Google AdWords Overview
Google AdWords Overview Google AdWords Overview
Google AdWords Overview
Saurabh Bhambry
 
Dopez votre rentabilité Google Ads avec une analyse n-gram
Dopez votre rentabilité Google Ads avec une analyse n-gramDopez votre rentabilité Google Ads avec une analyse n-gram
Dopez votre rentabilité Google Ads avec une analyse n-gram
Bruno Guyot
 
parametric hypothesis testing using MATLAB
parametric hypothesis testing using MATLABparametric hypothesis testing using MATLAB
parametric hypothesis testing using MATLAB
Kajal Saraswat
 
Google Cheat Sheet
Google Cheat SheetGoogle Cheat Sheet
Google Cheat Sheet
guest47b8f5d
 

Mais procurados (20)

Google AdWords Overview
Google AdWords Overview Google AdWords Overview
Google AdWords Overview
 
Google Ad-words Fundamentals
Google Ad-words Fundamentals Google Ad-words Fundamentals
Google Ad-words Fundamentals
 
AdWords廣告基礎 懶人包
AdWords廣告基礎 懶人包AdWords廣告基礎 懶人包
AdWords廣告基礎 懶人包
 
Google Ads Training
Google Ads TrainingGoogle Ads Training
Google Ads Training
 
PPC Restart 2023: Matouš Ledvina - AI jako klíč pro efektivní marketing
PPC Restart 2023: Matouš Ledvina - AI jako klíč pro efektivní marketingPPC Restart 2023: Matouš Ledvina - AI jako klíč pro efektivní marketing
PPC Restart 2023: Matouš Ledvina - AI jako klíč pro efektivní marketing
 
Winning at Google Ads for Lead Generation
Winning at Google Ads for Lead GenerationWinning at Google Ads for Lead Generation
Winning at Google Ads for Lead Generation
 
PPC Restart 2023: David Janoušek a Jan Janoušek - SATO aneb jak přemýšlet nad...
PPC Restart 2023: David Janoušek a Jan Janoušek - SATO aneb jak přemýšlet nad...PPC Restart 2023: David Janoušek a Jan Janoušek - SATO aneb jak přemýšlet nad...
PPC Restart 2023: David Janoušek a Jan Janoušek - SATO aneb jak přemýšlet nad...
 
The List Of Manual Traffic Exchange Sites
The List Of Manual Traffic Exchange SitesThe List Of Manual Traffic Exchange Sites
The List Of Manual Traffic Exchange Sites
 
Project 8 evaluate a display campaign
Project 8 evaluate a display campaignProject 8 evaluate a display campaign
Project 8 evaluate a display campaign
 
PPC Restart 2023: Lukáš Hvizdoš - Ako vyškálovať PMAX tak, aby sme dosiahli d...
PPC Restart 2023: Lukáš Hvizdoš - Ako vyškálovať PMAX tak, aby sme dosiahli d...PPC Restart 2023: Lukáš Hvizdoš - Ako vyškálovať PMAX tak, aby sme dosiahli d...
PPC Restart 2023: Lukáš Hvizdoš - Ako vyškálovať PMAX tak, aby sme dosiahli d...
 
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
Quicksort - a whistle-stop tour of the algorithm in five languages and four p...
 
Estudio márgenes Retail y Mayoreo
Estudio márgenes Retail y MayoreoEstudio márgenes Retail y Mayoreo
Estudio márgenes Retail y Mayoreo
 
O.M.GSEA - An in-depth introduction to gene-set enrichment analysis
O.M.GSEA - An in-depth introduction to gene-set enrichment analysisO.M.GSEA - An in-depth introduction to gene-set enrichment analysis
O.M.GSEA - An in-depth introduction to gene-set enrichment analysis
 
Udacity Project 7 Email Marketing
Udacity Project 7 Email MarketingUdacity Project 7 Email Marketing
Udacity Project 7 Email Marketing
 
Dopez votre rentabilité Google Ads avec une analyse n-gram
Dopez votre rentabilité Google Ads avec une analyse n-gramDopez votre rentabilité Google Ads avec une analyse n-gram
Dopez votre rentabilité Google Ads avec une analyse n-gram
 
parametric hypothesis testing using MATLAB
parametric hypothesis testing using MATLABparametric hypothesis testing using MATLAB
parametric hypothesis testing using MATLAB
 
Google Ads Presentation.pdf
Google Ads Presentation.pdfGoogle Ads Presentation.pdf
Google Ads Presentation.pdf
 
Boost academy - google ads smart bidding
Boost academy - google ads smart biddingBoost academy - google ads smart bidding
Boost academy - google ads smart bidding
 
Google Tag Manager for beginners
Google Tag Manager for beginnersGoogle Tag Manager for beginners
Google Tag Manager for beginners
 
Google Cheat Sheet
Google Cheat SheetGoogle Cheat Sheet
Google Cheat Sheet
 

Semelhante a Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbull, OpenSource Connections

Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Lucidworks
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
ebiquity
 

Semelhante a Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbull, OpenSource Connections (20)

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Masterclass San Francisco: Data-driven analysis of social conversations using...
Masterclass San Francisco: Data-driven analysis of social conversations using...Masterclass San Francisco: Data-driven analysis of social conversations using...
Masterclass San Francisco: Data-driven analysis of social conversations using...
 
Brandwatch masterclass presentation
Brandwatch masterclass presentationBrandwatch masterclass presentation
Brandwatch masterclass presentation
 
Scaling Saved Searches at eBay Kleinanzeigen
Scaling Saved Searches at eBay KleinanzeigenScaling Saved Searches at eBay Kleinanzeigen
Scaling Saved Searches at eBay Kleinanzeigen
 
BigData and Algorithms - LA Algorithmic Trading
BigData and Algorithms - LA Algorithmic TradingBigData and Algorithms - LA Algorithmic Trading
BigData and Algorithms - LA Algorithmic Trading
 
Discovery Hub: on-the-fly linked data exploratory search
Discovery Hub: on-the-fly linked data exploratory searchDiscovery Hub: on-the-fly linked data exploratory search
Discovery Hub: on-the-fly linked data exploratory search
 
The well tempered search application
The well tempered search applicationThe well tempered search application
The well tempered search application
 
Big data elasticsearch practical
Big data  elasticsearch practicalBig data  elasticsearch practical
Big data elasticsearch practical
 
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...
 
30 Jan 2012 - New York City
30 Jan 2012 - New York City30 Jan 2012 - New York City
30 Jan 2012 - New York City
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Elasticsearch at Dailymotion
Elasticsearch at DailymotionElasticsearch at Dailymotion
Elasticsearch at Dailymotion
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
Developing in R - the contextual Multi-Armed Bandit edition
Developing in R - the contextual Multi-Armed Bandit editionDeveloping in R - the contextual Multi-Armed Bandit edition
Developing in R - the contextual Multi-Armed Bandit edition
 
OpenML DALI
OpenML DALIOpenML DALI
OpenML DALI
 
The Functional Programmer's Toolkit (NDC London 2019)
The Functional Programmer's Toolkit (NDC London 2019)The Functional Programmer's Toolkit (NDC London 2019)
The Functional Programmer's Toolkit (NDC London 2019)
 
The filter bubble
The filter bubbleThe filter bubble
The filter bubble
 
From Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial IntelligenceFrom Natural Language Processing to Artificial Intelligence
From Natural Language Processing to Artificial Intelligence
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015
 
Webinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionWebinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with Fusion
 

Mais de Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

Mais de Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbull, OpenSource Connections

  • 1. OCTOBER  11-­‐14,  2016    •    BOSTON,  MA  
  • 2. Anyone  can  build  a  Recsys  w/  Solr!   Doug  Turnbull   Relevance  Consultant,  OpenSource  ConnecIons  
  • 3. I’m now available in book form! https://www.manning.com/books/relevant-search Discount code: relsearch (38% off) http://opensourceconnections.com/about-us/doug-turnbull/ Me The company...
  • 4. field  Body    term  laser          doc  2    <metadata>            doc  4    <metadata>        term  light          doc  2        <metadata>    term  lightsaber          doc  0   How do search engines work? The answer can be found in your textbook… OpenSource Connections Book Index: •  Topics -> page no •  Very efficient tool – compare to scanning the whole book! Lucene uses an index: •  Tokens => document ids: laser => [2, 4] light => [2, 5] lightsaber => [0, 1, 5, 7]
  • 5. What's the point? OpenSource Connections Solr: -  A general purpose system for looking up content based on features that describe them Tokens aren't really words! doc0: "I like the bananas" Analysis Analysis term I: doc 0 term lik doc 0 term banan: doc 0 [lik] [banan]Search: "liked banana?" [I] [lik] [banan]
  • 6. TF*IDF -- measuring feature weight OpenSource Connections term I: doc 0: freq: 5 doc 1: freq: 7 doc 3: freq: 4 term banan: doc 0: freq: 2 "Banana-ness" is pretty special "I-ness" is not special doc0: tf==5 df==3 (raw) TF*IDF = 5/3 = 1.6667 doc0: tf==2 df==1 (raw) TF*IDF = 2/1 = 2.0 Search is really distributed feature matching and similarity (text-oriented)
  • 7. Search often stands in for human interactions I have a craving for a nice juicy cut of meat. What might you recommend? I have JUST the thing!
  • 9. Modeling arbitrary feature strength OpenSource Connections term juicy: steak: juiciness: 5 grapefruit: juiciness: 7 orange: juiciness: 4 term meaty: burger: meatiness: 2 What you want: { item: "steak", juiciness: ["juicy", "juicy", "juicy"], meatiness: ["meaty"] } Use term frequency as feature strength: { item: "grapefruit", juiciness: ["juicy", "juicy", "juicy", "juicy", "juicy"], meatiness: [""] } (remember, Solr doesn't need to store this)
  • 10. TF*IDF -- measuring feature weight OpenSource Connections term juicy: doc 0: freq: 5 doc 1: freq: 7 doc 3: freq: 4 term meaty: doc 0: freq: 2 "meaty-ness" is pretty special "juicy-ness" is pretty non-special doc0: tf==5 df==3 (raw) TF*IDF = 5/3 = 1.6667 doc0: tf==2 df==1 (raw) TF*IDF = 2/1 = 2.0 Search is really distributed feature matching and similarity
  • 11. Requesting something from my grocer More juicy Less juicy More meaty Less meaty q=meaty juicy Results: 1. 2. 3.
  • 12. Recsys also stands in for human interactions Hi Jane, Recommend me something? Hmm… <Tom likes limes, what is similar to limes?>
  • 13. recommendations Use existing properties of thing to recommend similar things juicy citrus More like this for unstructured data What features/tokens are most representative of this thing? http://solr.quepid.com/solr/tmdb/select?q={!mlt%20qf=overview}97&fl=title,id,overview (movies like juicy citrus (search) Here's some ideas... { item: "lime", juiciness: ["juicy", "juicy", "juicy"], citrusness: ["citrus", "citrus", "citrus"], meatiness: [""], partyness: ["party"] }
  • 14. "Content Based" more-like-these Use existing properties of thing to recommend similar things juicy meaty citrus http://solr.quepid.com/solr/tmdb/select?q={!mlt%20qf=overview}97&fl=title,id,overview (movies like Here's some ideas... Jane knows a few more things that Tom likes...
  • 15. Personalization metadata Index extra data alongside your products { item: "hamburger", preferred_by_genders: ["m", …], preferred_by_ages: ["30_40"] } age:30_40 gender:m http://solr.quepid.com/solr/tmdb/select?q={!mlt%20qf=overview}97&fl=title,id,overview (movies like Here's some ideas... Jane knows a few things about Tom (30 yr old male)
  • 16. But, Jane's intuition transcends words! age:30_40 gender:m Currently we're stuck with predefined labels: citrus juicy meaty We're curating using known vocabularies (can we describe everything?)
  • 17. What we like often transcends words There are emergent properties of our world that don't have names Relative flarglewharbliness More flarglewharbilyLess flarglewharbily Diet Coke
  • 18. What's a flarglewharble? More flarglewharbilyLess flarglewharbily fruit orange lemon banana mentos diet coke tom X sue X X X charlie X X clare X X hal x x Goes together Diet Coke
  • 19. Can search find the flargles? q=(flargliwharbliness:very)  term  flarglewharble:          diet-­‐coke:              flargleness:  4          mentos:              flargleness:  3          banana              flargleness:  1       Can we somehow build? Diet Coke
  • 20. personfood orange lemon banana mentos diet coke tom X X sue X X X X charlie X X clare X X hal x x X Goes together flarglewharble! Babies often use made-up words based on emergent patterns in their universe They are less committed to our language
  • 21. What's the point? Collaborative filtering Latent vocabulary (the flarglewharbles) Pure Search Content-based Recs Predefined vocabulary Can Solr discover the latent/ emergent vocabularies?
  • 22. Can Solr discover the latent/ emergent vocabularies? Well first let's tell Solr about our users { user: "Sue" foods_bought: ["lemon", "banana", "mentos", "diet coke"] } { user: "Charlie" foods_bought: ["banana", "mentos", "diet coke"] }
  • 23. Faceting? We need a way to look across users and look for patterns (analyze all the baskets that contain mentos) q=foods_bought:mentos&facet=true&facet.field=foods_bought facets: mentos: 3 diet-coke: 3 banana: 2 Hmm: -  Bananas are globally popular -  Diet-coke is probably what matters
  • 24. Counts don't work: importance of significance q=foods_bought:mentos&facet=true&facet.field=foods_bought facets: mentos: 3 diet-coke: 3 banana: 2 Diet Coke: Global popularity: diet coke (3) Local popularity: 3 Score: 3/3 = 1 Banana: Global popularity: banana (4) Local popularity: 2 Score: 2/4 = 0.5 by-significance: diet-coke: 1 banana: 0.5
  • 25. Streaming Expressions /select?q=*:*&facet=true&facet.field=liked_movies But there's a new sheriff in town! One option: we could go about and gather global doc freqs & compare those locally. Terms component another option… plugins... Streaming expressions -- distributed stream computation system on top of Solr Cloud You must ALWAYS cross the streams!
  • 26. Streaming Expressions /stream?expr=scoreNodes(facet(...)...) facet(movielens, q="*:*", buckets="liked_movies", bucketSorts="count(*) desc", bucketSizeLimit="100", count(*)) Faceting with Streaming Expressions: Output: { "result-set": {"docs":[ { "count(*)":55807, "liked_movies":"318"}, { "count(*)":52352, "liked_movies":"296"}, { "count(*)":50114, "liked_movies":"593"} Nodes to be transformed
  • 27. Significance with streaming expr /stream?expr=scoreNodes(facet(...)...) scoreNodes( select( facet(movielens, q="liked_movies:2571 OR liked_movies:4993", buckets="liked_movies", bucketSorts="count(*) desc", bucketSizeLimit="100", count(*)), liked_movies as node, "count(*)", replace(collection, null, withValue=movielens), replace(field, null, withValue=liked_movies)) ) 1.  facet (just like above, just with streaming expr) 2.  select to format data for scoreNodes 3.  scoreNodes to score using TF*IDF Banana occurs in 2 documents here, 4 globally -- 2/4 = 0.5 Diet coke occurs 2 documents here, 2 globally -- 2/2 = 1.0 Thinking back on my shoppers behaviors, here's some other items you might like: (thanks Joel Bernstein!) Diet Coke
  • 28. Lots of power here /stream?expr=scoreNodes(facet(...)...) scoreNodes( select( facet(movielens, q="juiciness_pref:juicy", buckets="liked_movies", bucketSorts="count(*) desc", bucketSizeLimit="100", count(*)), liked_movies as node, "count(*)", replace(collection, null, withValue=movielens), replace(field, null, withValue=liked_movies)) ) Find users that like juicy things, what do they like? Perhaps bucket over the aisle they like? Construct our query to focus on a date range? Many insights (thanks Joel Bernstein!)
  • 29. Only glimpsing the underlying pattern... We're not enumerating the flarglewharbles, and the schlumblefumbles More flarglewharbilyLess flarglewharbily Diet Coke More schlumblewumblyLess schumblewumbly Diet Coke
  • 30. Coming soon (Solr 6.3) http://yonik.com/solr-6-3/ https://issues.apache.org/jira/browse/SOLR-9258 -  Models for training classifiers -  Then in turn updating documents Progress is being made! -  Clustering?