How to Spend Your Summer Keeping it Real with SearchHub

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

SearchHub: How to Spend Your Summer Keeping it Real
Grant Ingersoll
CTO, Lucidworks

3
01
SearchHub Demo
github.com/lucidworks/searchhub
http://searchhub.lucidworks.com

4
02
SearchHub Details
• Basics:
• 37 Apache Projects registered so far plus LW properties, opensource.com, Stack Overﬂow
• 130 datasources* including email, Github, JIRA*, Website and Wiki
• Fusion 2.4.2
• Signals everywhere
• UI based on View (work not complete)
• ASF Mail archives mirrored at: http://asfmail.lucidworks.io

5
03
Goals
• Company:
• “LucidFind” aka SearchHub on Fusion
• Provide backend for LW.com search, including
docs and support
• Real, production, living, breathing instance of
Fusion that we control
• Fusion best practices demo of major use cases
• CTO Ofﬁce
• Real data, including clicks
• Platform for machine learning and experimentation
• Demos and talks

6
01
Agenda
• Quick Intro to Fusion and SearchHub
• Fusion Conﬁguration, UI, Middle Tier
• Data Acquisition
• Deployment
• Signals and Machine Learning
• Next Steps

7
Drive next generation relevance
via Content, Collaboration and
Context
Built on best in class Open Source:
Apache Solr + Spark
Simplify application development
and reduce ongoing maintenance
Access data from
anywhere to build
intelligent, data-
driven applications.
Fusion in a Nutshell

8
01
Fusion
SECURITY BUILT-IN
Shards Shards
Apache Solr
Apache Zookeeper
ZK 1
Leader
Election
Load
Balancing
ZK N
Shared Conﬁg
Management
Worker Worker
Apache Spark
Cluster
Manager
Core Services
• • •
NLP
Recommenders / Signals
Blob Storage
Pipelines
Scheduling
Alerting / Messaging
Connectors
RESTAPI
Admin UI
Lucidworks
View
HDFS(Optional)
LOGS FILE WEB DATABASE CLOUD HADOOP

9
01
Fusion Configuration, UI and Middle Tier
• UI
• Derivative of Lucidworks View (https://lucidworks.com/products/view/)
• Deep integration of Snowplow Javascript Tracker (https://github.com/
snowplow/snowplow/wiki/javascript-tracker)
• Python Flask middle tier ($SEARCHHUB_HOME/python)
• Data sources (project_config)
• Pipelines (fusion_config)
• Schedules (fusion_config)

10
01
Data Acquisition
• Sources:
• ASF Mail archives mirrored at: http://asfmail.lucidworks.io
• Stack Overﬂow (SO)
• Github
• Processing
• Pipelines, including custom stage for parsing mail
• Main Challenges:
• “fail2ban” by the ASF
• Focused crawling of SO — JSoup FTW! (try.jsoup.org)
• Mail Threads

11
01
Deployment
• Client and Middle Tier run in a Docker container using Apache HTTPd and mod_wsgi
• Hosted on AWS (m4.2xls)
• Fusion backend is OOTB 2.4.2 with extra memory for Connectors and Solr
• README has the gory details: https://github.com/lucidworks/searchhub/blob/master/README.md

12
01
Signals
• UI is fully instrumented, using Snowplow Javascript Tracker, for most
user interactions. See SnowplowService.js
• Captures, amongst other things:
• User Id, Session Id, Unique Query Id, IP address, Location, Timing
data
• Actions tracked:
• Page View
• Page Ping (heartbeat) every 30 seconds
• Search with query, displayed doc list and displayed facet list
• Clicks with query, doc id, position, score and query UUID
• Typeahead Clicks with characters typed and suggestions offered

13
01
Machine Learning
• Fusion makes it easy to “round-trip” ML data/models between Spark and Solr
• Examples of:
• Recommenders
• Spark Lucene tokenization
• k-Means
• Word2Vec
• Topic Detection (LDA)
• Random Forests Classiﬁer
• Many examples SparkShellHelpers.scala

14
Experiment Management and Bandits
Get Started
• Goal: Experimentation, not hard coded rules*
• Goal: Drive down the cost of experimentation
• “A/B testing on steroids”
• Exploration vs. Exploitation
• Fusion 3.0 (beta):
• Record and calculate relevance metrics from w/in Fusion (gold
standard, TREC, other)
• Easily calculate MRR, NDCG, Precision, Recall and report over time
• Support for Bandits: Greedy Epsilon, SoftMax, UCB1

16
01
Still Hungry?
• “Combining Content and Collaboration in Recommenders” by Jake Mannix:
Friday at 1:10 pm http://sched.co/7amt
• https://github.com/lucidworks/searchhub
• http://searchhub.lucidworks.com
•Email: grant@lucidworks.com
•Twitter: @gsingers
•Web: http://lucidworks.com

How to Spend Your Summer Keeping it Real with SearchHub

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a How to Spend Your Summer Keeping it Real with SearchHub

Semelhante a How to Spend Your Summer Keeping it Real with SearchHub (20)

Mais de Lucidworks

Mais de Lucidworks (20)

Último

Último (20)

How to Spend Your Summer Keeping it Real with SearchHub