2. About me
● Bryan Warner - Developer @Traackr
○ bwarner@traackr.com
● I've worked with ElasticSearch since early 2012 ...
before that I had worked with Lucene & Solr
● Primary background is in Java back-end development
● Shifting focus into Scala development past year
3. About Traackr
● Influencer search engine
● We track content daily & in real-time for our database of
influential people
● We leverage ElasticSearch parent/child (top-children)
queries to search content (i.e. the children) to surface
the influencers who've authored it (i.e. the parents)
● Some of our back-end stack includes: ElasticSearch,
MongoDb, Java/Spring, Scala/Akka, etc.
4. Overview
● Indexing / Querying strategies to support language-
targeted searches within ES
● ES Analyzers / TokenFilters for language analysis
● Custom Analyzers / TokenFilters for ES
● Look at some OS projects that assist in language
detection & analysis
5. Use Case
● We have a database of articles written in many
languages
● We want our users to be able to search articles written
in a particular language
● We want that search to handle the nuances for that
particular language
7. Indexing Strategies
Separate indices per language
- OR -
Same index for all languages
8. Indexing Strategies
Separate Indices per language
PROS
■ Clean separation
■ Truer IDF values
○ IDF = log(numDocs/(docFreq+1)) + 1
CONS
■ Increased Overhead
■ Parent/Child queries -> parent document duplication
○ Same problem for Solr Joins
■ Maintain schema per index
9. Indexing Strategies
Same index for all languages
PROS
■ One index to maintain (and one schema)
■ Parent/Child queries are fine
CONS
■ Schema complexity grows
■ IDF values might be skewed
10. Indexing Strategies
Same index for all languages ... how?
1. Create different "mapping" types per language
a. At indexing time, we set the right mapping based on
the article's language
2. Create different fields per language-analyzed field
a. At indexing time, we populate the correct text field
based on the article's language
13. Querying Strategies
How do we execute a language-targeted search?
... all based on our indexing strategy.
14. Querying Strategies
(1) Separate Indices per language
...
String targetIndex = getIndexForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch(targetIndex)
.setTypes("article");
QueryStringQueryBuilder query = QueryBuilders.queryString(
"boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one
request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
15. Querying Strategies
(2a) Same index for language - Diff. mappings
...
String targetMapping = getMappingForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch("your_index")
.setTypes(targetMapping);
QueryStringQueryBuilder query = QueryBuilders.queryString(
"boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one
request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
16. Querying Strategies
(2b) Same index for language - Diff. fields
...
SearchRequestBuilder request = client.prepareSearch("your_index")
.setTypes("article");
QueryStringQueryBuilder query = QueryBuilders.queryString(
"boston elasticsearch");
query.field(text_en|text_fr|text_de); // pick one
query.analyzer(english|french|german); // pick one
request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
17. Querying Strategies
● Will these strategies support a multi-language search?
○ E.g. Search by french and german
○ E.g. Search against all languages
● Yes! *
● In the same SearchRequest:
○ We can search against multiple indices
○ We can search against multiple "mapping" types
○ We can search against multiple fields
* Need to give thought which query analyzer to use
18. Language Analysis
● What does ElasticSearch and/or Lucene offer us for
analyzing various languages?
● Is there a one-size-fits-all solution?
○ e.g. StandardAnalyzer
● Or do we need custom analyzers for each language?
19. Language Analysis
StandardAnalyzer - The Good
● For many languages (french, spanish), it will get you
95% of the way there
● Each language analyzer provides its own flavor to the
StandardAnalyzer
● FrenchAnalyzer
○ Adds an ElisionFilter (l'avion -> avion)
○ Adds French StopWords filter
○ FrenchLightStemFilter
20. Language Analysis
StandardAnalyzer - The Bad
● For some languages, it will get you 2/3 of the way there
● German has a heavy use of compound words
■ das Vaterland => The fatherland
■ Rechtsanwaltskanzleien => Law Firms
● For best search results, these compound words should
produce index terms for their individual parts
● GermanAnalyzer lacks a Word Compound Token Filter
21. Language Analysis
StandardAnalyzer - The Ugly
● For other languages (e.g. Asian languages), it will not
get you far
● Using a Standard Tokenizer to extract tokens from
Chinese text will not produce accurate terms
○ Some 3rd-party Chinese analyzers will extract
bigrams from Chinese text and index those as if they
were words
● Need to do your research
22. Language Analysis
You should also know about...
● ASCII Folding Token Filter
○ über => uber
● ICU Analysis Plugin
○ http://www.elasticsearch.org/guide/reference/index-
modules/analysis/icu-plugin.html
○ Allows for unicode normalization, collation and
folding
23. Custom Analyzer / Token Filter
● Let's create a custom analyzer definition for German
text (e.g. remove stemming)
● How do we go about doing this?
○ One way is to leverage ElasticSearch's flexible
schema definitions
25. Custom Analyzer / Token Filter
Create a custom German analyzer in our schema:
"settings" : {
....
"analysis":{
"analyzer":{
"custom_text_german":{
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase"], stop words, german normalization?
}
}
....
}
}
26. Custom Analyzer / Token Filter
1. Declare schema filter for german stop_words
2. We'll also need to create a custom TokenFilter class to wrap Lucene's org.
apache.lucene.analysis.de.GermanNormalizationFilter
a. It does not come as a pre-defined ES TokenFilter
b. German text needs to normalize on certain characters based .. e.g.
'ae' and 'oe' are replaced by 'a', and 'o', respectively.
3. Declare schema filter for custom GermanNormalizationFilter
30. OS Projects
Language Detection
● https://code.google.com/p/language-detection/
○ Written in Java
○ Provides language profiles with unigram, bigram, and trigram
character frequencies
○ Detector provides accuracy % for each language detected
PROS
■ Very fast (~4k pieces of text per second)
■ Very reliable for text greater than 30-40 characters
CONS
■ Unreliable & inconsistent for small text samples (<30 characters) ... i.e.
short tweets
31. OS Projects
German Word Decompounder
● https://github.com/jprante/elasticsearch-analysis-decompound
● Lucene offers two compound word token filters, a dictionary- &
hyphenation-based variant
○ Not bundled with Lucene due to licensing issues
○ Require loading a word list in memory before they are run
● The decompounder uses prebuilt Compact Patricia Tries for efficient word
segmentation provided by the ASV toolbox
○ ASV Toolbox project - http://wortschatz.uni-leipzig.
de/~cbiemann/software/toolbox/index.htm