4. 4
02
SearchHub Details
• Basics:
• 37 Apache Projects registered so far plus LW properties, opensource.com, Stack Overflow
• 130 datasources* including email, Github, JIRA*, Website and Wiki
• Fusion 2.4.2
• Signals everywhere
• UI based on View (work not complete)
• ASF Mail archives mirrored at: http://asfmail.lucidworks.io
5. 5
03
Goals
• Company:
• “LucidFind” aka SearchHub on Fusion
• Provide backend for LW.com search, including
docs and support
• Real, production, living, breathing instance of
Fusion that we control
• Fusion best practices demo of major use cases
• CTO Office
• Real data, including clicks
• Platform for machine learning and experimentation
• Demos and talks
6. 6
01
Agenda
• Quick Intro to Fusion and SearchHub
• Fusion Configuration, UI, Middle Tier
• Data Acquisition
• Deployment
• Signals and Machine Learning
• Next Steps
7. 7
Drive next generation relevance
via Content, Collaboration and
Context
Built on best in class Open Source:
Apache Solr + Spark
Simplify application development
and reduce ongoing maintenance
Access data from
anywhere to build
intelligent, data-
driven applications.
Fusion in a Nutshell
9. 9
01
Fusion Configuration, UI and Middle Tier
• UI
• Derivative of Lucidworks View (https://lucidworks.com/products/view/)
• Deep integration of Snowplow Javascript Tracker (https://github.com/
snowplow/snowplow/wiki/javascript-tracker)
• Python Flask middle tier ($SEARCHHUB_HOME/python)
• Data sources (project_config)
• Pipelines (fusion_config)
• Schedules (fusion_config)
10. 10
01
Data Acquisition
• Sources:
• ASF Mail archives mirrored at: http://asfmail.lucidworks.io
• Stack Overflow (SO)
• Github
• Processing
• Pipelines, including custom stage for parsing mail
• Main Challenges:
• “fail2ban” by the ASF
• Focused crawling of SO — JSoup FTW! (try.jsoup.org)
• Mail Threads
11. 11
01
Deployment
• Client and Middle Tier run in a Docker container using Apache HTTPd and mod_wsgi
• Hosted on AWS (m4.2xls)
• Fusion backend is OOTB 2.4.2 with extra memory for Connectors and Solr
• README has the gory details: https://github.com/lucidworks/searchhub/blob/master/README.md
12. 12
01
Signals
• UI is fully instrumented, using Snowplow Javascript Tracker, for most
user interactions. See SnowplowService.js
• Captures, amongst other things:
• User Id, Session Id, Unique Query Id, IP address, Location, Timing
data
• Actions tracked:
• Page View
• Page Ping (heartbeat) every 30 seconds
• Search with query, displayed doc list and displayed facet list
• Clicks with query, doc id, position, score and query UUID
• Typeahead Clicks with characters typed and suggestions offered
13. 13
01
Machine Learning
• Fusion makes it easy to “round-trip” ML data/models between Spark and Solr
• Examples of:
• Recommenders
• Spark Lucene tokenization
• k-Means
• Word2Vec
• Topic Detection (LDA)
• Random Forests Classifier
• Many examples SparkShellHelpers.scala
14. 14
Experiment Management and Bandits
Get Started
• Goal: Experimentation, not hard coded rules*
• Goal: Drive down the cost of experimentation
• “A/B testing on steroids”
• Exploration vs. Exploitation
• Fusion 3.0 (beta):
• Record and calculate relevance metrics from w/in Fusion (gold
standard, TREC, other)
• Easily calculate MRR, NDCG, Precision, Recall and report over time
• Support for Bandits: Greedy Epsilon, SoftMax, UCB1