Handwritten Text Recognition for manuscripts and early printed texts
Hacking Lucene and Solr for Fun and Profit
1.
2. HACKING LUCENE AND
SOLR FOR FUN AND
PROFIT
Grant Ingersoll
CTO, LucidWorks,
grant@lucidworks.com, @gsingers
3. Keyword Search is so yesterday
•
Search is a system building block
– text is only a part of the story
•
If the algorithms fit,
use them!
•
Embrace fuzziness!
•
Scoring features are everywhere
4. Lucene and Solr can do…
•
Classic: Fast, fuzzy text matching across a large document collection
•
Data Quality and Analysis
– Faceting, slicing and dicing of numerical/enumerated data
– Spatial
– Spell checking, record linkage, highlighting
– Stats, Missing fields, etc.
•
Top N problems
5. Topics
• Search Hacks
• “Trust me, I’m a mathematician”
• “I wish I had thought of that” Hack
10. Analysis
•
•
Split into sentences
– Buffer tokens – see com.tamingtext.texttamer.solr.SentenceTokenizer
Identify Names using OpenNLP
•
Add Entity marker tokens at the same position as original token
– Could also be done with Payloads
•
Index
•
https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/textta
mer/solr
https://github.com/tamingtext/book/blob/master/apache-solr/solrqa/conf/schema.xml
•
11. Search Side
• Custom Query Parser takes in user’s natural language query,
classifies it to find the Answer Type and generates Solr query
• Retrieve candidate passages that match keywords and expected
answer type
• Unlike keyword search, we need to know exactly where matches
occur
• https://github.com/tamingtext/book/tree/master/src/main/java/com/
tamingtext/qa
12. Answer Type Classification
• Answer Type examples:
– Person (P), Location (L), Organization (O), Time Point (T),
Duration (R), Money (M)
– See page 248 for more
• Train an OpenNLP classifier off of a set of previously annotated
questions, e.g.:
– P Which French monarch reinstated the divine right of the
monarchy to France and was known as `The Sun King'
because of the splendour of his reign?
15. kNN and TF/IDF Classification w/ Lucene
https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/classifier/mlt
16. Lucene Classification Module
•
Builds classifier off of index information
•
See the org.apache.lucene.classification package
•
Naïve Bayes Classifier
•
kNN Classifier
•
Perceptron Classifier
17. Recommenders
•
•
•
•
•
Cross recommendation as search
– with search used to build cross recommendation!
Recommend content to people who exhibit certain behaviors (clicks, query terms,
other)
(Ab)use of a search engine
– but not as a search engine for content
– more like a search engine for behavior
See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation
Algorithms
– http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms
Go get Mahout/Myrrix or just do it in y(our) search engine
22. Time Space Continuum
•
Leverage Solr’s new spatial capabilities to index non-spatial data, such as time
ranges
– Useful for Open Hours, Shifts, etc.
•
Key: multi-valued range data
•
Query using rectangle intersections
– q = shift:"Intersects(0 19 23 365)”
•
Credits to David Smiley and Hoss…
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/