Search Intelligence @elo7.com

Search Intelligence @ elo7.com
Fernando Meyer, Felipe Besson
March 9, 2013

Outline
Some data about our data
Some history
Apache Solr
How Lucene Works
Examples
Terms
Inverted index
How a result is scored against a query in Lucene
Lucene conceptual Scoring formula [?]

Search Intelligence
How have we optimized our index
How to declare a solr index
Infrastructure Upgrade
version 2 - single node
version 3 - current infrastructure
Frenzy API
Example of product operation
Content recommendation
Architecture
http://elo7.com 2013 3/29

Search Intelligence
Current Scenario
Future WorkContent Tracker
BigData Analytics

Search Intelligence
About
Fernando Meyer - Undergrad in Applied Mathematics for University of São Paulo.
Holds more than 12 years of experience in R&D deploying cool systems for
companies like RedHat(JBoss), Globo and Locaweb. Currently is focusing his
research and interests in machine learning, information retrieve and statistics.
Felipe Besson - B.S. in Information Systems and Masters in Computer Sci-
ence for the University of São Paulo, Brazil. His research focused on automated
testing of web services composition. Now, he is expanding his horizons by working
with searching, data mining, machine learning and other geek stuﬀ.

Search Intelligence
Some data about our data
• 3000 (avg.) queries per second
• from 3500 to 4200 users on site per minute
• 15000 requests per minute on AppServer
• 160000 (avg.) bot/requests per day
• 160000 (avg.) bot/requests per day
• 1200000 indexed products
• 20000 active sellers

Search Intelligence
Some history
• Search v0.0 - select * from product where text like ’%query%’
• Search v0.1 - Sphinx
– No delta index
– Poor index/query performance for large scale dataset
• Search v1.0 - Apache Solr

Search Intelligence
Apache Solr
Solr is written in Java and runs as a standalone full-text search server within a
servlet container such as Jetty. Solr uses the Lucene Java search library at its
core for full-text indexing and search, and has REST-like HTTP/XML and JSON
APIs that make it easy to use from virtually any programming language.

Search Intelligence
How Lucene Works
Lucene is an inverted full-text index. This means that it takes all the documents,
splits them into words, and then builds an index for each word. Since the index
is an exact string-match, unordered, it can be extremely fast.

Search Intelligence
Examples
Terms
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
Inverted index
"a": {(2, 2)}
"banana": {(2, 3)}
"is": {(0, 1), (0, 4), (1, 1), (2, 1)}
"it": {(0, 0), (0, 3), (1, 2), (2, 0)}
"what": {(0, 2), (1, 0)}
http://elo7.com 2013 10/29

Search Intelligence
How a result is scored against a query in Lucene
A.K.A: That answer to the dollar question: Why isn’t this product appearing by
searching "bleh"
Lucene conceptual Scoring formula [?]
score(q,d) = coord-factor(q,d).query-boost(q). A·B
A B .doc-len-norm(d).score(d)
http://elo7.com 2013 11/29

Search Intelligence
How have we optimized our index
<fieldType name="text_pt_br" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="com.elo7.solr.analysis.OrengoStemmerFilterFa
http://elo7.com 2013 12/29

Search Intelligence
exceptionList="stemmerignore.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonym
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
http://elo7.com 2013 13/29

Search Intelligence
<filter class="com.elo7.solr.analysis.OrengoStemmerFilterFa
exceptionList="stemmerignore.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
http://elo7.com 2013 14/29

Search Intelligence
How to declare a solr index
<field name="id" type="int" indexed="true"
stored="true" required="true" />
<field name="title" type="text_pt_br"
indexed="true" stored="true"/>
<field name="description" type="text_pt_br"
indexed="true" stored="false" />
<field name="tags" type="text_pt_br"
indexed="true" stored="true" multiValued="true"/>
http://elo7.com 2013 15/29

Search Intelligence
Infrastructure Upgrade
version 2 - single node
• Scaling issues
• M1.xlarge => m2.2xlarge => c1.xlarge 90% CPU
• Solr 3.6
• Full index with ruby scripts (takes 3.5hs to full index )
http://elo7.com 2013 16/29

Search Intelligence
version 3 - current infrastructure
• 3 m1.xlarge (20% CPU Usage) behind an amazon ELB
• 1 m1.xlarge Search API (50% of logged users staging )
• Solr Data Importer (takes 15mn to full index)
http://elo7.com 2013 17/29

Search Intelligence
Frenzy API
Solr environment evolution
• Operations: Searching, indexing and deleting
• Resources: Products, stores, auto-complete suggestions and categories
• Recommendations
Advantages
• Removing search and indexing logic from marketplace
• Providing a search service to other applications (e.g., mobile)
http://elo7.com 2013 18/29

Search Intelligence
Example of product operation
Searching
• input (GET): query term
– ﬁlters: city, min. price and max. price
– sort: featured, organic, oldest, newest, ...
• output (json)
– metadata (query status, response time and hits)
– list of products
– references (previous and next page urls)
http://elo7.com 2013 19/29

Search Intelligence
Content recommendation
• Collaborative ﬁltering (user similarity)
• Based on user favorited products
Input (GET)
• frenzy/users/:id/recommendations
Output: (similiar to search output)
http://elo7.com 2013 20/29

Search Intelligence
Architecture
http://elo7.com 2013 21/29

Search Intelligence
Current Scenario
• Experimental stage
• Search operations are being integrated
• 50% of logged user searches are using the API
• Recommendation API is being evolved
http://elo7.com 2013 22/29

Search Intelligence
Future WorkContent Tracker
We need to understand, track, analyse and take advantage on our users navigation
patterns.
• Any user receiver an unique ID
• This ID follows any user’s interaction with the website
• Whenever an user interacts with a product: views; add to favorites; social
share; add to cart or buys. we trigger a convertion action.
http://elo7.com 2013 23/29

Search Intelligence
SearchID UserID Term pgN Filters
A376AC e00c59 "abajur" 1 Nil
A376AD e00c59 "abajur" 1 "pr:[10.0,15.0]"
A376AE e00c59 "abajur" 1 "pr:[10.0,15.0] city:curitiba"
Table 1: Search Action logger
http://elo7.com 2013 24/29

Search Intelligence
ViewID SearchID PRDID PPP
000001 A376AE 201209 1
000002 A37FED 204439 5
000003 EDA342 202234 1
000004 EFDBC1 231324 5
000005 EDA563 214512 2
000006 EFA564 264553 13
Table 2: Product View logger
http://elo7.com 2013 25/29

Search Intelligence
ActionID ViewID type
000001 000001 cart
000002 000002 fav
000003 000005 cart
000004 000004 social
000005 000003 ship
000006 000006 contact
Table 3: Product Action logger
http://elo7.com 2013 26/29

Search Intelligence
ActionID convert
000001 true
000002 true
000003 false
000004 false
000005 false
000006 true
Table 4: Action to convert
http://elo7.com 2013 27/29

Search Intelligence
BigData Analytics
• Product conversion per channel
• Consumer behaviour
• Trends
• Better recomendation (including new users)
• Better emailmarketing (attractiveness )
• Per product stats (Clicks/Impressions/CTR)
http://elo7.com 2013 28/29

Questions?
fmeyer@elo7.com
felipe.besson@elo7.com

Search Intelligence @elo7.com

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (10)

Destaque

Destaque (8)

Semelhante a Search Intelligence @elo7.com

Semelhante a Search Intelligence @elo7.com (20)

Último

Último (20)

Search Intelligence @elo7.com