SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
Language Search
ElasticSearch Boston Meetup - 3/27
       Bryan Warner - Traackr
About me
● Bryan Warner - Developer @Traackr
  ○ bwarner@traackr.com

● I've worked with ElasticSearch since early 2012 ...
  before that I had worked with Lucene & Solr

● Primary background is in Java back-end development

● Shifting focus into Scala development past year
About Traackr
● Influencer search engine

● We track content daily & in real-time for our database of
  influential people

● We leverage ElasticSearch parent/child (top-children)
  queries to search content (i.e. the children) to surface
  the influencers who've authored it (i.e. the parents)

● Some of our back-end stack includes: ElasticSearch,
  MongoDb, Java/Spring, Scala/Akka, etc.
Overview
● Indexing / Querying strategies to support language-
  targeted searches within ES

● ES Analyzers / TokenFilters for language analysis

● Custom Analyzers / TokenFilters for ES

● Look at some OS projects that assist in language
  detection & analysis
Use Case
● We have a database of articles written in many
  languages

● We want our users to be able to search articles written
  in a particular language

● We want that search to handle the nuances for that
  particular language
Reference Schema
{
    "settings" : {
      "index": {
        "number_of_shards" : 6, "number_of_replicas" : 1
      },
      "analysis":{
        "analyzer": {}, "tokenizer": {}, "filter":{}
      }
    },
    "mappings": {
      "article": {
        "text" : {"type" : "string", "analyzer":"standard", "store":true},
        "author:" {"type" : "string", "analyzer":"simple", "store": true},
        "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
      }
    }
}
Indexing Strategies



      Separate indices per language
                  - OR -
       Same index for all languages
Indexing Strategies
Separate Indices per language
PROS
■ Clean separation
■ Truer IDF values
  ○ IDF = log(numDocs/(docFreq+1)) + 1

CONS
■ Increased Overhead
■ Parent/Child queries -> parent document duplication
   ○ Same problem for Solr Joins
■ Maintain schema per index
Indexing Strategies
Same index for all languages
PROS
■ One index to maintain (and one schema)
■ Parent/Child queries are fine

CONS
■ Schema complexity grows
■ IDF values might be skewed
Indexing Strategies
Same index for all languages ... how?
1. Create different "mapping" types per language
   a. At indexing time, we set the right mapping based on
      the article's language

2. Create different fields per language-analyzed field
   a. At indexing time, we populate the correct text field
      based on the article's language
"mappings": {
  "article_en": {
    "text" : {"type" : "string", "analyzer":"english", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  },
  "article_fr": {
    "text" : {"type" : "string", "analyzer":"french", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  },
  "article_de": {
    "text" : {"type" : "string", "analyzer":"german", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  }
}
"mappings": {
  "article": {
    "text_en" : {"type" : "string", "analyzer":"english", "store":true},
    "text_fr" : {"type" : "string", "analyzer":"french", "store":true},
    "text_de" : {"type" : "string", "analyzer":"german", "store":true},
    "author:" {"type" : "string", "analyzer":"simple", "store": true}
    "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true}
  }
}
Querying Strategies
How do we execute a language-targeted search?

... all based on our indexing strategy.
Querying Strategies
(1) Separate Indices per language
...
String targetIndex = getIndexForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch(targetIndex)
       .setTypes("article");

QueryStringQueryBuilder query = QueryBuilders.queryString(
      "boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
Querying Strategies
(2a) Same index for language - Diff. mappings
...
String targetMapping = getMappingForLanguage(languageParam);
SearchRequestBuilder request = client.prepareSearch("your_index")
       .setTypes(targetMapping);

QueryStringQueryBuilder query = QueryBuilders.queryString(
      "boston elasticsearch");
query.field("text");
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
Querying Strategies
(2b) Same index for language - Diff. fields
...
SearchRequestBuilder request = client.prepareSearch("your_index")
     .setTypes("article");

QueryStringQueryBuilder query = QueryBuilders.queryString(
      "boston elasticsearch");
query.field(text_en|text_fr|text_de); // pick one
query.analyzer(english|french|german); // pick one

request.setQuery(query);
SearchResponse searchResponse = request.execute().actionGet();
...
Querying Strategies
● Will these strategies support a multi-language search?
  ○ E.g. Search by french and german
  ○ E.g. Search against all languages

● Yes! *

● In the same SearchRequest:
   ○ We can search against multiple indices
   ○ We can search against multiple "mapping" types
   ○ We can search against multiple fields

* Need to give thought which query analyzer to use
Language Analysis
● What does ElasticSearch and/or Lucene offer us for
  analyzing various languages?

● Is there a one-size-fits-all solution?
   ○ e.g. StandardAnalyzer

● Or do we need custom analyzers for each language?
Language Analysis
StandardAnalyzer - The Good
● For many languages (french, spanish), it will get you
  95% of the way there

● Each language analyzer provides its own flavor to the
  StandardAnalyzer

● FrenchAnalyzer
  ○ Adds an ElisionFilter (l'avion -> avion)
  ○ Adds French StopWords filter
  ○ FrenchLightStemFilter
Language Analysis
StandardAnalyzer - The Bad
● For some languages, it will get you 2/3 of the way there

● German has a heavy use of compound words
     ■ das Vaterland => The fatherland
     ■ Rechtsanwaltskanzleien => Law Firms

● For best search results, these compound words should
  produce index terms for their individual parts

● GermanAnalyzer lacks a Word Compound Token Filter
Language Analysis
StandardAnalyzer - The Ugly
● For other languages (e.g. Asian languages), it will not
  get you far

● Using a Standard Tokenizer to extract tokens from
  Chinese text will not produce accurate terms
  ○ Some 3rd-party Chinese analyzers will extract
     bigrams from Chinese text and index those as if they
     were words

● Need to do your research
Language Analysis
You should also know about...
● ASCII Folding Token Filter
  ○ über => uber

● ICU Analysis Plugin
   ○ http://www.elasticsearch.org/guide/reference/index-
     modules/analysis/icu-plugin.html
   ○ Allows for unicode normalization, collation and
     folding
Custom Analyzer / Token Filter
● Let's create a custom analyzer definition for German
  text (e.g. remove stemming)

● How do we go about doing this?
   ○ One way is to leverage ElasticSearch's flexible
     schema definitions
Lucene 3.6 - org.apache.lucene.analysis.de.GermanAnalyzer
Custom Analyzer / Token Filter
Create a custom German analyzer in our schema:
"settings" : {
  ....
  "analysis":{
    "analyzer":{
       "custom_text_german":{
          "type": "custom",
           "tokenizer": "standard",
           "filter": ["standard", "lowercase"], stop words, german normalization?
       }
    }
    ....
  }
}
Custom Analyzer / Token Filter
1.   Declare schema filter for german stop_words
2.   We'll also need to create a custom TokenFilter class to wrap Lucene's org.
     apache.lucene.analysis.de.GermanNormalizationFilter
     a.   It does not come as a pre-defined ES TokenFilter
     b.   German text needs to normalize on certain characters based .. e.g.
          'ae' and 'oe' are replaced by 'a', and 'o', respectively.

3.   Declare schema filter for custom GermanNormalizationFilter
package org.elasticsearch.index.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.inject.assistedinject.Assisted;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;

public class GermanNormalizationFilterFactory extends AbstractTokenFilterFactory {
  @Inject
  public GermanNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings,
           @Assisted String name, @Assisted Settings settings) {
     super(index, indexSettings, name, settings);
  }
  @Override
  public TokenStream create(TokenStream tokenStream) {
     return new GermanNormalizationFilter(tokenStream);
  }
}
Custom Analyzer / Token Filter
Define new token filters in our schema:
"settings" : {
  "analysis":{
     ....
     "filter":{
       "german_normalization":{
          "type":"org.elasticsearch.index.analysis.GermanNormalizationFilterFactory"
       },
       "german_stop":{
          "type":"stop",
          "stopwords":["_german_"],
          "enable_position_increments":"true"
       }
     }
....
Custom Analyzer / Token Filter
Create a custom German analyzer:
"settings" : {
  ....
  "analysis":{
    "analyzer":{
       "custom_text_german":{
          "type":"custom",
           "tokenizer": "standard",
           "filter": ["german_normalization", "standard", "lowercase", "german_stop"],
       }
    }
    ....
  }
}
OS Projects
Language Detection
●   https://code.google.com/p/language-detection/
     ○ Written in Java
     ○ Provides language profiles with unigram, bigram, and trigram
         character frequencies
     ○ Detector provides accuracy % for each language detected

PROS
 ■ Very fast (~4k pieces of text per second)
 ■ Very reliable for text greater than 30-40 characters

CONS
 ■ Unreliable & inconsistent for small text samples (<30 characters) ... i.e.
   short tweets
OS Projects
German Word Decompounder
●   https://github.com/jprante/elasticsearch-analysis-decompound

●   Lucene offers two compound word token filters, a dictionary- &
    hyphenation-based variant
     ○ Not bundled with Lucene due to licensing issues
     ○ Require loading a word list in memory before they are run

●   The decompounder uses prebuilt Compact Patricia Tries for efficient word
    segmentation provided by the ASV toolbox
     ○ ASV Toolbox project - http://wortschatz.uni-leipzig.
        de/~cbiemann/software/toolbox/index.htm

Mais conteúdo relacionado

Semelhante a Language Search

Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfInexture Solutions
 
06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and AnalysisOpenThink Labs
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache PinotSiddharth Teotia
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 
JLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingJLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingChase Tingley
 
Reducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisReducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisSebastiano Panichella
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesJamund Ferguson
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-stepsMatteo Moci
 
Static Analysis in Go
Static Analysis in GoStatic Analysis in Go
Static Analysis in GoTakuya Ueda
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overviewAmit Juneja
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)Woonsan Ko
 
Ts archiving
Ts   archivingTs   archiving
Ts archivingConfiz
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research ReportAlex Sumner
 

Semelhante a Language Search (20)

Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdf
 
06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis06. ElasticSearch : Mapping and Analysis
06. ElasticSearch : Mapping and Analysis
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 
New Features in Apache Pinot
New Features in Apache PinotNew Features in Apache Pinot
New Features in Apache Pinot
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Plc part 2
Plc  part 2Plc  part 2
Plc part 2
 
Elasto Mania
Elasto ManiaElasto Mania
Elasto Mania
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
JLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're goingJLIFF: Where we are, and where we're going
JLIFF: Where we are, and where we're going
 
Reducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code AnalysisReducing Redundancies in Multi-Revision Code Analysis
Reducing Redundancies in Multi-Revision Code Analysis
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-steps
 
Static Analysis in Go
Static Analysis in GoStatic Analysis in Go
Static Analysis in Go
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)
 
Ts archiving
Ts   archivingTs   archiving
Ts archiving
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

Language Search

  • 1. Language Search ElasticSearch Boston Meetup - 3/27 Bryan Warner - Traackr
  • 2. About me ● Bryan Warner - Developer @Traackr ○ bwarner@traackr.com ● I've worked with ElasticSearch since early 2012 ... before that I had worked with Lucene & Solr ● Primary background is in Java back-end development ● Shifting focus into Scala development past year
  • 3. About Traackr ● Influencer search engine ● We track content daily & in real-time for our database of influential people ● We leverage ElasticSearch parent/child (top-children) queries to search content (i.e. the children) to surface the influencers who've authored it (i.e. the parents) ● Some of our back-end stack includes: ElasticSearch, MongoDb, Java/Spring, Scala/Akka, etc.
  • 4. Overview ● Indexing / Querying strategies to support language- targeted searches within ES ● ES Analyzers / TokenFilters for language analysis ● Custom Analyzers / TokenFilters for ES ● Look at some OS projects that assist in language detection & analysis
  • 5. Use Case ● We have a database of articles written in many languages ● We want our users to be able to search articles written in a particular language ● We want that search to handle the nuances for that particular language
  • 6. Reference Schema { "settings" : { "index": { "number_of_shards" : 6, "number_of_replicas" : 1 }, "analysis":{ "analyzer": {}, "tokenizer": {}, "filter":{} } }, "mappings": { "article": { "text" : {"type" : "string", "analyzer":"standard", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true}, "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} } } }
  • 7. Indexing Strategies Separate indices per language - OR - Same index for all languages
  • 8. Indexing Strategies Separate Indices per language PROS ■ Clean separation ■ Truer IDF values ○ IDF = log(numDocs/(docFreq+1)) + 1 CONS ■ Increased Overhead ■ Parent/Child queries -> parent document duplication ○ Same problem for Solr Joins ■ Maintain schema per index
  • 9. Indexing Strategies Same index for all languages PROS ■ One index to maintain (and one schema) ■ Parent/Child queries are fine CONS ■ Schema complexity grows ■ IDF values might be skewed
  • 10. Indexing Strategies Same index for all languages ... how? 1. Create different "mapping" types per language a. At indexing time, we set the right mapping based on the article's language 2. Create different fields per language-analyzed field a. At indexing time, we populate the correct text field based on the article's language
  • 11. "mappings": { "article_en": { "text" : {"type" : "string", "analyzer":"english", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} }, "article_fr": { "text" : {"type" : "string", "analyzer":"french", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} }, "article_de": { "text" : {"type" : "string", "analyzer":"german", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} } }
  • 12. "mappings": { "article": { "text_en" : {"type" : "string", "analyzer":"english", "store":true}, "text_fr" : {"type" : "string", "analyzer":"french", "store":true}, "text_de" : {"type" : "string", "analyzer":"german", "store":true}, "author:" {"type" : "string", "analyzer":"simple", "store": true} "date": {"type" : "date", "format" : "yyyy-MM-dd'T'HH:mm:ssZ", "store":true} } }
  • 13. Querying Strategies How do we execute a language-targeted search? ... all based on our indexing strategy.
  • 14. Querying Strategies (1) Separate Indices per language ... String targetIndex = getIndexForLanguage(languageParam); SearchRequestBuilder request = client.prepareSearch(targetIndex) .setTypes("article"); QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch"); query.field("text"); query.analyzer(english|french|german); // pick one request.setQuery(query); SearchResponse searchResponse = request.execute().actionGet(); ...
  • 15. Querying Strategies (2a) Same index for language - Diff. mappings ... String targetMapping = getMappingForLanguage(languageParam); SearchRequestBuilder request = client.prepareSearch("your_index") .setTypes(targetMapping); QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch"); query.field("text"); query.analyzer(english|french|german); // pick one request.setQuery(query); SearchResponse searchResponse = request.execute().actionGet(); ...
  • 16. Querying Strategies (2b) Same index for language - Diff. fields ... SearchRequestBuilder request = client.prepareSearch("your_index") .setTypes("article"); QueryStringQueryBuilder query = QueryBuilders.queryString( "boston elasticsearch"); query.field(text_en|text_fr|text_de); // pick one query.analyzer(english|french|german); // pick one request.setQuery(query); SearchResponse searchResponse = request.execute().actionGet(); ...
  • 17. Querying Strategies ● Will these strategies support a multi-language search? ○ E.g. Search by french and german ○ E.g. Search against all languages ● Yes! * ● In the same SearchRequest: ○ We can search against multiple indices ○ We can search against multiple "mapping" types ○ We can search against multiple fields * Need to give thought which query analyzer to use
  • 18. Language Analysis ● What does ElasticSearch and/or Lucene offer us for analyzing various languages? ● Is there a one-size-fits-all solution? ○ e.g. StandardAnalyzer ● Or do we need custom analyzers for each language?
  • 19. Language Analysis StandardAnalyzer - The Good ● For many languages (french, spanish), it will get you 95% of the way there ● Each language analyzer provides its own flavor to the StandardAnalyzer ● FrenchAnalyzer ○ Adds an ElisionFilter (l'avion -> avion) ○ Adds French StopWords filter ○ FrenchLightStemFilter
  • 20. Language Analysis StandardAnalyzer - The Bad ● For some languages, it will get you 2/3 of the way there ● German has a heavy use of compound words ■ das Vaterland => The fatherland ■ Rechtsanwaltskanzleien => Law Firms ● For best search results, these compound words should produce index terms for their individual parts ● GermanAnalyzer lacks a Word Compound Token Filter
  • 21. Language Analysis StandardAnalyzer - The Ugly ● For other languages (e.g. Asian languages), it will not get you far ● Using a Standard Tokenizer to extract tokens from Chinese text will not produce accurate terms ○ Some 3rd-party Chinese analyzers will extract bigrams from Chinese text and index those as if they were words ● Need to do your research
  • 22. Language Analysis You should also know about... ● ASCII Folding Token Filter ○ über => uber ● ICU Analysis Plugin ○ http://www.elasticsearch.org/guide/reference/index- modules/analysis/icu-plugin.html ○ Allows for unicode normalization, collation and folding
  • 23. Custom Analyzer / Token Filter ● Let's create a custom analyzer definition for German text (e.g. remove stemming) ● How do we go about doing this? ○ One way is to leverage ElasticSearch's flexible schema definitions
  • 24. Lucene 3.6 - org.apache.lucene.analysis.de.GermanAnalyzer
  • 25. Custom Analyzer / Token Filter Create a custom German analyzer in our schema: "settings" : { .... "analysis":{ "analyzer":{ "custom_text_german":{ "type": "custom", "tokenizer": "standard", "filter": ["standard", "lowercase"], stop words, german normalization? } } .... } }
  • 26. Custom Analyzer / Token Filter 1. Declare schema filter for german stop_words 2. We'll also need to create a custom TokenFilter class to wrap Lucene's org. apache.lucene.analysis.de.GermanNormalizationFilter a. It does not come as a pre-defined ES TokenFilter b. German text needs to normalize on certain characters based .. e.g. 'ae' and 'oe' are replaced by 'a', and 'o', respectively. 3. Declare schema filter for custom GermanNormalizationFilter
  • 27. package org.elasticsearch.index.analysis; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.de.GermanNormalizationFilter; import org.elasticsearch.common.inject.Inject; import org.elasticsearch.common.inject.assistedinject.Assisted; import org.elasticsearch.common.settings.Settings; import org.elasticsearch.index.Index; import org.elasticsearch.index.settings.IndexSettings; public class GermanNormalizationFilterFactory extends AbstractTokenFilterFactory { @Inject public GermanNormalizationFilterFactory(Index index, @IndexSettings Settings indexSettings, @Assisted String name, @Assisted Settings settings) { super(index, indexSettings, name, settings); } @Override public TokenStream create(TokenStream tokenStream) { return new GermanNormalizationFilter(tokenStream); } }
  • 28. Custom Analyzer / Token Filter Define new token filters in our schema: "settings" : { "analysis":{ .... "filter":{ "german_normalization":{ "type":"org.elasticsearch.index.analysis.GermanNormalizationFilterFactory" }, "german_stop":{ "type":"stop", "stopwords":["_german_"], "enable_position_increments":"true" } } ....
  • 29. Custom Analyzer / Token Filter Create a custom German analyzer: "settings" : { .... "analysis":{ "analyzer":{ "custom_text_german":{ "type":"custom", "tokenizer": "standard", "filter": ["german_normalization", "standard", "lowercase", "german_stop"], } } .... } }
  • 30. OS Projects Language Detection ● https://code.google.com/p/language-detection/ ○ Written in Java ○ Provides language profiles with unigram, bigram, and trigram character frequencies ○ Detector provides accuracy % for each language detected PROS ■ Very fast (~4k pieces of text per second) ■ Very reliable for text greater than 30-40 characters CONS ■ Unreliable & inconsistent for small text samples (<30 characters) ... i.e. short tweets
  • 31. OS Projects German Word Decompounder ● https://github.com/jprante/elasticsearch-analysis-decompound ● Lucene offers two compound word token filters, a dictionary- & hyphenation-based variant ○ Not bundled with Lucene due to licensing issues ○ Require loading a word list in memory before they are run ● The decompounder uses prebuilt Compact Patricia Tries for efficient word segmentation provided by the ASV toolbox ○ ASV Toolbox project - http://wortschatz.uni-leipzig. de/~cbiemann/software/toolbox/index.htm