SlideShare uma empresa Scribd logo
1 de 22
About us
• Founded in 2007
• B2B Startup
• Semantic Services started in
2011
• Big Data projects in 2012
• Training, Consultancy and
Support on Solr and Hadoop
About me
• Leo Oliveira
• 15 years of experience with
websites and search engines
• Specialized in Relevancy and Semantics
• Graduated in Business Management & IT
Innovation
The Importance of Relevancy
 The game changer in Search
 Google became what it is due to relevancy algorithms and PageRank
 Can be achieved through many different types of data
 Cross-reference, log analysis, social media, market research, new ideas etc
 It’s about the user
 And not about what we think of the user
 If you don’t understand your data, find what is relevant through
research
 Research user behavior and needs and you’ll find a way of finding relevant
information or you will find a way of discovering what is relevant automatically.
 If the user can’t find what they want using your search…
 … they’ll leave.
The Importance of having a resultset
 Sometimes you have little data and users get a lot of zero-result
queries.
 Then it’s time to find the right words for the job
 Synonym discovery, stemming
 Analysis: understand what is the user searching for
 Suggest other options
 Sometimes a zero-result query is just typo. Use suggestions to find the right
results.
 Create an interface that makes it easy for the user to try different keywords to get
results
 Analyze words that are searched together by the same users to create special
suggestions or second searches that brings some results
Relevancy and Semantics
 THE RELEVANCY RULES (the most forgotten ones)
 Phrases are more relevant the single words
 “Pure” words, that is, words the way they were typed, are more relevant than
stemmed words or even synonyms
 In general, newest docs are more relevant than old docs, but this rule could be
reversed.
 Closer venues and locations are more relevant than distant ones.
 Give the right weights to the right fields (seems simple, but it’s not)
 When you have a mix of relevant things…
 for instance, in an e-commerce, that you must consider the freshest docs, plus the
smaller prices and the best sellers, sometimes you need to create a math formula to
cope with all these items for relevancy. And it’s tricky.
 You can boost each parameter separately, but then you must test your formulas
very well not to prioritize one parameter more than the other.
Relevancy and Weighting
 USING DISMAX (an example of relevancy configuration)
 By using Dismax, you have two main options: qf and pf.
 Phrase Fields (pf) should have bigger weights than all of your Query Fields (qf)
 Field weights must be decided carefully. You can use a scale, such as Fibonacci
numbers, so as to not lose control of what is more relevant and what is not that relevant in
your “search formula”
 EXAMPLE OF pf for a movies website:
 MovieName^34 MovieActors^21 MovieDirector^13 MovieYear^8
 EXAMPLE OF qf:
 MovieName^21 MovieActors^13 MovieDirector^8 MovieYear^5
Synonyms Theory Vs Synonyms in Search
Synonym Theory Vs Synonym in Search
 In search, synonyms can get complicated
 You don’t need to use only “real” synonyms
 You must focus on user needs rather than “thesaurus”
 You can use it for other useful stuff, such as the most common typos to be
corrected or even some unrelated words that could do the job for you
 Sometimes you just need to get creative.
1-way transformations
 SynonymFilter
 E. g. Compter => computer
 Very useful for common typos or misconceptions
 Sometimes requires expansion.
 In our setup, we’ll set a fieldtype that will do the job.
2-way transformations
 SynonymFilter
 E. g. computer, pc, mac, notebook, server
 Expansion of meanings
 Sometimes you don’t use exact synonyms. Brand names and other “real world” terms can
be used.
 It’s hard to configure and maintain 1-way and 2-way in a single synonyms.txt file.
Index Time Vs Query Time
 Most synonyms should work with Query Time only
 But there are exceptions. E. g.: photoalbum, photo album
 When using a phrase synonym, this might require an index time synonym to work
correctly
 In our fieldtype, we’ll discuss how to achieve this.
 This index time synonym file is also useful for search when you need symbols in some
terms that could be droped by the tokenizer. E. g. C++ => cpp or .NET => dotnet etc
 Keep in mind that this index time file will have the exceptions that don’t work well in query-
time. To figure this out, sometimes it requires testing, sometimes it is easy to identify, such
as phrase synonyms.
Some examples
 Query time
 Enginer => Engineer (1-way)
 Engineer, engineering (2-way with expand=true)
 Computer, pc, mac
 Index time
 C# => csharp
 .NET => dotnet
 Human resources, HR
 Business intelligence, BI
 CRM, Costumer Relationship Management
Creating a fieldtype with semantic relevancy
<fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/>
<filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1"
generateWordParts="1" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.BrazilianStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1"
generateWordParts="1" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.BrazilianStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creating a fieldtype with semantic relevancy
<fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt"
ignoreCase="true" expand="true"/>
3. <filter catenateAll="0" catenateNumbers="1" catenateWords="1"
class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
4. <filter class="solr.LowerCaseFilterFactory"/>
5. <filter class="solr.BrazilianStemFilterFactory"/>
6. <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
Creating a fieldtype with semantic relevancy
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt"
ignoreCase="true" expand="false"/>
3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt"
ignoreCase="true" expand="true"/>
4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0"
catenateWords="0" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
5. <filter class="solr.LowerCaseFilterFactory"/>
6. <filter class="solr.BrazilianStemFilterFactory"/>
7. <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creating a fieldtype with semantic relevancy
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt"
ignoreCase="true" expand="false"/>
3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt"
ignoreCase="true" expand="true"/>
4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0"
catenateWords="0" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
5. <filter class="solr.LowerCaseFilterFactory"/>
6. <filter class="solr.BrazilianStemFilterFactory"/>
7. <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creating a fieldtype with semantic relevancy
<fieldType class="solr.TextField" name="text_pt_pure" positionIncrementGap="100”>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
<filter catenateAll="0" catenateNumbers="1" catenateWords="1"
class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
Creating a relevant formula for search
 Imagine you have a field named “Product_name”
 To keep search result semantically relevant, you’ll need a “Product_name” field and also a
“Product_name_pure”
 Keep in mind that phrases are also more relevant than individual words
 In that case, here’s how Solrconfig would be:
 QF (query fields, single words)
 Product_name_pure^5 Product_name^3
 PF (phrase fields)
 Product_name_pure^8 Product_nameˆ5
What will you get?
 Relevant results semantically
 Users will understand the results
 Synonyms or stemmed tokens won’t disturb or create noise
 Similar to what main search websites are doing for many different things, such as
document search, e-commerce, intranet search, document search, library search etc
 Be able to find relevant results even in a scenario with millions or billions of documents.
I N F I N I T E P O S S I B I L I T I ES
THANK YOU!
Get in touch!
Email: loliveira@semantix.com.br
Twitter: @SemantixBR

Mais conteúdo relacionado

Destaque

Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...Daryl Chow
 
Seleccion de trabajos publicados en La Voz de Asturias
Seleccion de trabajos publicados en La Voz de AsturiasSeleccion de trabajos publicados en La Voz de Asturias
Seleccion de trabajos publicados en La Voz de AsturiasJavier Cuevas
 
Situaciones deseadas y no deseadas
Situaciones deseadas y no deseadasSituaciones deseadas y no deseadas
Situaciones deseadas y no deseadasEmerson Salcedo
 
pl-visit Shop
pl-visit Shoppl-visit Shop
pl-visit Shopplvisit
 
Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22solomonmin
 
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops PresentationYoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops PresentationPHATCreative
 
Food systems transformation: what is the role of pulses in the sustainability...
Food systems transformation: what is the role of pulses in the sustainability...Food systems transformation: what is the role of pulses in the sustainability...
Food systems transformation: what is the role of pulses in the sustainability...ExternalEvents
 
Regulatory Information Management
Regulatory Information ManagementRegulatory Information Management
Regulatory Information ManagementJBScott44
 
Edu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experimentsEdu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experiments6primariaparecollvic
 
Presentación de Internet TV (IPTV)
Presentación de Internet TV (IPTV)Presentación de Internet TV (IPTV)
Presentación de Internet TV (IPTV)trademarketingHE
 
Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)PriSkills Knowledge Solutions
 
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian ZamoraLa práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian Zamoracristiansamisan
 
Trignometria 5º primera parte
Trignometria 5º   primera parteTrignometria 5º   primera parte
Trignometria 5º primera partecjperu
 
DHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenationDHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenationDHIFAF BAGDAH
 

Destaque (17)

Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...
 
Seleccion de trabajos publicados en La Voz de Asturias
Seleccion de trabajos publicados en La Voz de AsturiasSeleccion de trabajos publicados en La Voz de Asturias
Seleccion de trabajos publicados en La Voz de Asturias
 
Situaciones deseadas y no deseadas
Situaciones deseadas y no deseadasSituaciones deseadas y no deseadas
Situaciones deseadas y no deseadas
 
pl-visit Shop
pl-visit Shoppl-visit Shop
pl-visit Shop
 
ReuniÓn De Padres
ReuniÓn De PadresReuniÓn De Padres
ReuniÓn De Padres
 
Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22
 
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops PresentationYoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
 
Food systems transformation: what is the role of pulses in the sustainability...
Food systems transformation: what is the role of pulses in the sustainability...Food systems transformation: what is the role of pulses in the sustainability...
Food systems transformation: what is the role of pulses in the sustainability...
 
Blusens Networks IPTV 2013
Blusens Networks IPTV 2013Blusens Networks IPTV 2013
Blusens Networks IPTV 2013
 
Regulatory Information Management
Regulatory Information ManagementRegulatory Information Management
Regulatory Information Management
 
Edu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experimentsEdu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experiments
 
Presentación de Internet TV (IPTV)
Presentación de Internet TV (IPTV)Presentación de Internet TV (IPTV)
Presentación de Internet TV (IPTV)
 
Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)
 
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian ZamoraLa práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
 
GUILLERMO SIMÓN.
GUILLERMO SIMÓN.GUILLERMO SIMÓN.
GUILLERMO SIMÓN.
 
Trignometria 5º primera parte
Trignometria 5º   primera parteTrignometria 5º   primera parte
Trignometria 5º primera parte
 
DHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenationDHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenation
 

Semelhante a Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA

How publishing works in the digital era
How publishing works in the digital eraHow publishing works in the digital era
How publishing works in the digital eraApex CoVantage
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”voginip
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in SalesforceMoyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in SalesforceMoyez Thanawalla
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsBen DeMott
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Gabriel Moreira
 
Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?Fred Leise
 
Full text search
Full text searchFull text search
Full text searchdeleteman
 
How To Write The Conclusion Of A. Online assignment writing service.
How To Write The Conclusion Of A. Online assignment writing service.How To Write The Conclusion Of A. Online assignment writing service.
How To Write The Conclusion Of A. Online assignment writing service.Peggy Johnson
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
E1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7dE1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7dKlaus Lieblich
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
 
Classification, Tagging & Search
Classification, Tagging & SearchClassification, Tagging & Search
Classification, Tagging & SearchJames Melzer
 
You Don't Know SEO
You Don't Know SEOYou Don't Know SEO
You Don't Know SEOMichael King
 
Essential Tools Of An Xml Workflow2003comp
Essential Tools Of An Xml Workflow2003compEssential Tools Of An Xml Workflow2003comp
Essential Tools Of An Xml Workflow2003compljnd
 
Automating With Excel An Object Oriented Approach
Automating  With  Excel    An  Object  Oriented  ApproachAutomating  With  Excel    An  Object  Oriented  Approach
Automating With Excel An Object Oriented ApproachRazorleaf Corporation
 
How To Write A Compare And Contrast Essay Example
How To Write A Compare And Contrast Essay ExampleHow To Write A Compare And Contrast Essay Example
How To Write A Compare And Contrast Essay ExampleJennifer Hellmuth
 

Semelhante a Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA (20)

How publishing works in the digital era
How publishing works in the digital eraHow publishing works in the digital era
How publishing works in the digital era
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Oops Concepts
Oops ConceptsOops Concepts
Oops Concepts
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in SalesforceMoyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementations
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
 
Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?
 
Full text search
Full text searchFull text search
Full text search
 
How To Write The Conclusion Of A. Online assignment writing service.
How To Write The Conclusion Of A. Online assignment writing service.How To Write The Conclusion Of A. Online assignment writing service.
How To Write The Conclusion Of A. Online assignment writing service.
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
E1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7dE1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7d
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Classification, Tagging & Search
Classification, Tagging & SearchClassification, Tagging & Search
Classification, Tagging & Search
 
You Don't Know SEO
You Don't Know SEOYou Don't Know SEO
You Don't Know SEO
 
Essential Tools Of An Xml Workflow2003comp
Essential Tools Of An Xml Workflow2003compEssential Tools Of An Xml Workflow2003comp
Essential Tools Of An Xml Workflow2003comp
 
11.pptx
11.pptx11.pptx
11.pptx
 
Automating With Excel An Object Oriented Approach
Automating  With  Excel    An  Object  Oriented  ApproachAutomating  With  Excel    An  Object  Oriented  Approach
Automating With Excel An Object Oriented Approach
 
How To Write A Compare And Contrast Essay Example
How To Write A Compare And Contrast Essay ExampleHow To Write A Compare And Contrast Essay Example
How To Write A Compare And Contrast Essay Example
 

Último

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Último (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA

  • 1.
  • 2. About us • Founded in 2007 • B2B Startup • Semantic Services started in 2011 • Big Data projects in 2012 • Training, Consultancy and Support on Solr and Hadoop
  • 3. About me • Leo Oliveira • 15 years of experience with websites and search engines • Specialized in Relevancy and Semantics • Graduated in Business Management & IT Innovation
  • 4. The Importance of Relevancy  The game changer in Search  Google became what it is due to relevancy algorithms and PageRank  Can be achieved through many different types of data  Cross-reference, log analysis, social media, market research, new ideas etc  It’s about the user  And not about what we think of the user  If you don’t understand your data, find what is relevant through research  Research user behavior and needs and you’ll find a way of finding relevant information or you will find a way of discovering what is relevant automatically.  If the user can’t find what they want using your search…  … they’ll leave.
  • 5. The Importance of having a resultset  Sometimes you have little data and users get a lot of zero-result queries.  Then it’s time to find the right words for the job  Synonym discovery, stemming  Analysis: understand what is the user searching for  Suggest other options  Sometimes a zero-result query is just typo. Use suggestions to find the right results.  Create an interface that makes it easy for the user to try different keywords to get results  Analyze words that are searched together by the same users to create special suggestions or second searches that brings some results
  • 6. Relevancy and Semantics  THE RELEVANCY RULES (the most forgotten ones)  Phrases are more relevant the single words  “Pure” words, that is, words the way they were typed, are more relevant than stemmed words or even synonyms  In general, newest docs are more relevant than old docs, but this rule could be reversed.  Closer venues and locations are more relevant than distant ones.  Give the right weights to the right fields (seems simple, but it’s not)  When you have a mix of relevant things…  for instance, in an e-commerce, that you must consider the freshest docs, plus the smaller prices and the best sellers, sometimes you need to create a math formula to cope with all these items for relevancy. And it’s tricky.  You can boost each parameter separately, but then you must test your formulas very well not to prioritize one parameter more than the other.
  • 7. Relevancy and Weighting  USING DISMAX (an example of relevancy configuration)  By using Dismax, you have two main options: qf and pf.  Phrase Fields (pf) should have bigger weights than all of your Query Fields (qf)  Field weights must be decided carefully. You can use a scale, such as Fibonacci numbers, so as to not lose control of what is more relevant and what is not that relevant in your “search formula”  EXAMPLE OF pf for a movies website:  MovieName^34 MovieActors^21 MovieDirector^13 MovieYear^8  EXAMPLE OF qf:  MovieName^21 MovieActors^13 MovieDirector^8 MovieYear^5
  • 8. Synonyms Theory Vs Synonyms in Search
  • 9. Synonym Theory Vs Synonym in Search  In search, synonyms can get complicated  You don’t need to use only “real” synonyms  You must focus on user needs rather than “thesaurus”  You can use it for other useful stuff, such as the most common typos to be corrected or even some unrelated words that could do the job for you  Sometimes you just need to get creative.
  • 10. 1-way transformations  SynonymFilter  E. g. Compter => computer  Very useful for common typos or misconceptions  Sometimes requires expansion.  In our setup, we’ll set a fieldtype that will do the job.
  • 11. 2-way transformations  SynonymFilter  E. g. computer, pc, mac, notebook, server  Expansion of meanings  Sometimes you don’t use exact synonyms. Brand names and other “real world” terms can be used.  It’s hard to configure and maintain 1-way and 2-way in a single synonyms.txt file.
  • 12. Index Time Vs Query Time  Most synonyms should work with Query Time only  But there are exceptions. E. g.: photoalbum, photo album  When using a phrase synonym, this might require an index time synonym to work correctly  In our fieldtype, we’ll discuss how to achieve this.  This index time synonym file is also useful for search when you need symbols in some terms that could be droped by the tokenizer. E. g. C++ => cpp or .NET => dotnet etc  Keep in mind that this index time file will have the exceptions that don’t work well in query- time. To figure this out, sometimes it requires testing, sometimes it is easy to identify, such as phrase synonyms.
  • 13. Some examples  Query time  Enginer => Engineer (1-way)  Engineer, engineering (2-way with expand=true)  Computer, pc, mac  Index time  C# => csharp  .NET => dotnet  Human resources, HR  Business intelligence, BI  CRM, Costumer Relationship Management
  • 14. Creating a fieldtype with semantic relevancy <fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/> <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.BrazilianStemFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.BrazilianStemFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
  • 15. Creating a fieldtype with semantic relevancy <fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> 2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/> 3. <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> 4. <filter class="solr.LowerCaseFilterFactory"/> 5. <filter class="solr.BrazilianStemFilterFactory"/> 6. <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer>
  • 16. Creating a fieldtype with semantic relevancy <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> 2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/> 3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/> 4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> 5. <filter class="solr.LowerCaseFilterFactory"/> 6. <filter class="solr.BrazilianStemFilterFactory"/> 7. <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
  • 17. Creating a fieldtype with semantic relevancy <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> 2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/> 3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/> 4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> 5. <filter class="solr.LowerCaseFilterFactory"/> 6. <filter class="solr.BrazilianStemFilterFactory"/> 7. <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
  • 18. Creating a fieldtype with semantic relevancy <fieldType class="solr.TextField" name="text_pt_pure" positionIncrementGap="100”> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer>
  • 19. Creating a relevant formula for search  Imagine you have a field named “Product_name”  To keep search result semantically relevant, you’ll need a “Product_name” field and also a “Product_name_pure”  Keep in mind that phrases are also more relevant than individual words  In that case, here’s how Solrconfig would be:  QF (query fields, single words)  Product_name_pure^5 Product_name^3  PF (phrase fields)  Product_name_pure^8 Product_nameˆ5
  • 20. What will you get?  Relevant results semantically  Users will understand the results  Synonyms or stemmed tokens won’t disturb or create noise  Similar to what main search websites are doing for many different things, such as document search, e-commerce, intranet search, document search, library search etc  Be able to find relevant results even in a scenario with millions or billions of documents.
  • 21. I N F I N I T E P O S S I B I L I T I ES
  • 22. THANK YOU! Get in touch! Email: loliveira@semantix.com.br Twitter: @SemantixBR