SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
Embracing Diversity: Searching
over multiple languages
Tommaso Teofili
Suneel Marthi
June 12, 2017
Berlin Buzzwords, Berlin, Germany
1
$WhoAreWe
Tommaso Teofili
 @tteofili
Software Engineer, Adobe Systems
Member of Apache Software Foundation,
PMC Chair, Apache Lucene
Committer and PMC on Apache Joshua, Apache OpenNLP, Apache JackRabbit
Suneel Marthi
 @suneelmarthi
Principal Software Engineer, Office of Technology, Red Hat
Member of Apache Software Foundation
Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams
2
Agenda
What is Multi-Lingual Search ?
Why Multi-Lingual Search ?
What is Statistical Machine Translation ?
Overview of Apache Joshua
Dataflow Pipeline
Demo
3
What is Multi-Lingual Search ?
4
Searching
over content written in different languages
with users speaking different languages
both
Parallel corpora
Translating queries
Translating documents
5
Why Multi-Lingual Search ?
6
Embracing diversity
Most online tech content is in English
Wikipedia dumps:
en: 62GB
de: 17GB
it:   10GB
Good number of non-English speaking users
A lot of search queries are composed in English
Preferable to retrieve search results in native
language
… or even to consolidate all results in one language
7
UC1 — tech domain, native first
8
UC2 — native only ?
9
What is Machine Translation ?
10
Generate Translations from Statistical Models trained
on Bilingual Corpora.
Translation happens per a probability distribution
p(e/f)
E = string in the target language (English)
F = string in the source language (Spanish)
e~ = argmax p(e/f) = argmax p(f/e) * p(e)
e~ = best translation, the one with highest probability
11
Word-based Translation
12
How to translate a word → lookup in dictionary
Gebäude — building, house, tower.
Multiple translations
some more frequent than others
for instance: house and building most common
13
Look at a parallel corpus
(German text along with English translation)
Translation of Gebäude Count Probability
house 5.28 billion 0.51
building 4.16 billion 0.402
tower 9.28 million 0.09
14
Alignment
In a parallel text (or when we translate), we align
words in one language with the word in the other
Das Gebäude ist hoch
↓ ↓ ↓ ↓
the building is high
Word positions are numbered 1—4
15
Alignment Function
Define the Alignment with an Alignment Function
Mapping an English target word at position i to a
German source word at position j with a function a :
i → j
Example
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4}
16
One-to-Many Translation
A source word could translate into multiple target words
Das ist ein Hochhaus   
↓ ↓ ↓ ↙ ↓ ↘
This is a high    rise building
17
Phrase-based Translation
18
Alignment Function
Word-Based Models translate words as atomic units
Phrase-Based Models translate phrases as atomic
units
Advantages:
many-to-many translation can handle non-
compositional phrases
use of local context in translation
the more data, the longer phrases can be learned
“Standard Model”, used by Google Translate and
others
19
Phrase-Based Model
Berlin ist ein herausragendes Kunst- und Kulturzentrum .
↓ ↓ ↓ ↓ ↓ ↓
Berlin is an outstanding Art and cultural center .
Foreign input is segmented in phrases
Each phrase is translated into English
Phrases are reordered
20
Decoding
21
We have a mathematical model for translation
p(e|f)
Task of decoding: find the translation ebest with
highest probability
Two types of error
the most probable translation is bad → fix the
model
search does not find the most probable translation
→ fix the search
ebest = argmax p(e|f)
22
Translation Process
Translate this query from German into English
er trinkt ja noch nichts
er        
↓        
he        
Pick and input phrase, translate
23
Translation Process
Translate this query from German into English
er trinkt ja noch nichts
er     ja noch nichts
↓      
he   does not yet  
Pick and input phrase, translate
24
Translation Process
Translate this query from German into English
er trinkt ja noch nichts
er   trinkt   ja noch nichts
↓      
he   does not yet   drink
Pick and input phrase, translate
25
Apache Joshua
26
Statistical Machine Translation Decoder for phrase-based and hierarch
machine translation
Written in Java
Provide 64 language packs for machine translation
https://cwiki.apache.org/confluence/display/JOSHUA/Language+P
Project initiated by Johns Hopkins Univ. and University of Pennsylvania
Presently incubating at Apache Software Foundation
Used extensively by Amazon.com, NASA JPL
https://cwiki.apache.org/confluence/display/JOSHUA
 @ApacheJoshua
27
Flows
28
References
Apache Joshua — https://cwiki.apache.org/confluence/display/JOSHUA
Apache OpenNLP — https://opennlp.apache.org
GitHub — https://github.com/smarthi/BBuzz-multilang-search
Slides — https://smarthi.github.io/bbuzz17-embracing-diversity-searching-
over-multiple-languages/#/
29
Credits
Joern Kottmann — PMC Chair, Apache OpenNLP
Matt Post — PMC Chair, Apache Joshua
Bruno P. Kinoshita — Committer on Apache OpenNLP,
committer and PMC on Apache Commons and Apache
Jena
30
Questions ???
31

Mais conteúdo relacionado

Mais procurados

2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinalProf. Wim Van Criekinge
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
Programming languages vienna
Programming languages viennaProgramming languages vienna
Programming languages viennagreg_s
 
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesAntonio Toral
 
ANTLR4 and its testing
ANTLR4 and its testingANTLR4 and its testing
ANTLR4 and its testingKnoldus Inc.
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programmingSrinivas Narasegouda
 
Python and its Applications
Python and its ApplicationsPython and its Applications
Python and its ApplicationsAbhijeet Singh
 
Lambda The Extreme: Test-Driving a Functional Language
Lambda The Extreme: Test-Driving a Functional LanguageLambda The Extreme: Test-Driving a Functional Language
Lambda The Extreme: Test-Driving a Functional LanguageAccenture | SolutionsIQ
 
You too can nlp - PyBay 2018 lightning talk
You too can nlp - PyBay 2018 lightning talkYou too can nlp - PyBay 2018 lightning talk
You too can nlp - PyBay 2018 lightning talkJacob Perkins
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text MiningSushanti Acharya
 
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYAChapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYAMaulik Borsaniya
 
An Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source LemmatizerAn Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source LemmatizerCOMRADES project
 

Mais procurados (20)

2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
Programming languages vienna
Programming languages viennaProgramming languages vienna
Programming languages vienna
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesParallel Corpora in (Machine) Translation: goals, issues and methodologies
Parallel Corpora in (Machine) Translation: goals, issues and methodologies
 
Lets learn Python !
Lets learn Python !Lets learn Python !
Lets learn Python !
 
ANTLR4 and its testing
ANTLR4 and its testingANTLR4 and its testing
ANTLR4 and its testing
 
OpenNLP demo
OpenNLP demoOpenNLP demo
OpenNLP demo
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programming
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Python and its Applications
Python and its ApplicationsPython and its Applications
Python and its Applications
 
Python Programming
Python ProgrammingPython Programming
Python Programming
 
Python
PythonPython
Python
 
Lambda The Extreme: Test-Driving a Functional Language
Lambda The Extreme: Test-Driving a Functional LanguageLambda The Extreme: Test-Driving a Functional Language
Lambda The Extreme: Test-Driving a Functional Language
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Python - the basics
Python - the basicsPython - the basics
Python - the basics
 
You too can nlp - PyBay 2018 lightning talk
You too can nlp - PyBay 2018 lightning talkYou too can nlp - PyBay 2018 lightning talk
You too can nlp - PyBay 2018 lightning talk
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text Mining
 
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYAChapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
 
An Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source LemmatizerAn Extensible Multilingual Open Source Lemmatizer
An Extensible Multilingual Open Source Lemmatizer
 

Semelhante a Embracing diversity searching over multiple languages

How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningLena Shakurova
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of PatentsIconic Translation Machines
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to HindiRajat Jain
 
The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationIconic Translation Machines
 
Preparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for TranslationsPreparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for TranslationsHPCC Systems
 
Top 100 PHP Questions and Answers
Top 100 PHP Questions and AnswersTop 100 PHP Questions and Answers
Top 100 PHP Questions and Answersiimjobs and hirist
 
MT domain customization – conditions and benefits. Chris Wendt (Microsoft)
MT domain customization – conditions and benefits. Chris Wendt (Microsoft)MT domain customization – conditions and benefits. Chris Wendt (Microsoft)
MT domain customization – conditions and benefits. Chris Wendt (Microsoft)TAUS - The Language Data Network
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Iconic Translation Machines
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
Multilingualism ifla 2014 08
Multilingualism ifla 2014 08Multilingualism ifla 2014 08
Multilingualism ifla 2014 08Janifer Gatenby
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Yunyao Li
 

Semelhante a Embracing diversity searching over multiple languages (20)

Nlp
NlpNlp
Nlp
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learning
 
Swift vs. Language X
Swift vs. Language XSwift vs. Language X
Swift vs. Language X
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to Hindi
 
The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine Translation
 
Preparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for TranslationsPreparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for Translations
 
Top 100 PHP Questions and Answers
Top 100 PHP Questions and AnswersTop 100 PHP Questions and Answers
Top 100 PHP Questions and Answers
 
MT domain customization – conditions and benefits. Chris Wendt (Microsoft)
MT domain customization – conditions and benefits. Chris Wendt (Microsoft)MT domain customization – conditions and benefits. Chris Wendt (Microsoft)
MT domain customization – conditions and benefits. Chris Wendt (Microsoft)
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
PHP TRAINING
PHP TRAININGPHP TRAINING
PHP TRAINING
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Moses
MosesMoses
Moses
 
Programing Language
Programing LanguagePrograming Language
Programing Language
 
Php packages
Php packagesPhp packages
Php packages
 
Python overview
Python overviewPython overview
Python overview
 
Multilingualism ifla 2014 08
Multilingualism ifla 2014 08Multilingualism ifla 2014 08
Multilingualism ifla 2014 08
 
“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
 

Mais de Suneel Marthi

Measuring vegetation health to predict natural hazards
Measuring vegetation health to predict natural hazardsMeasuring vegetation health to predict natural hazards
Measuring vegetation health to predict natural hazardsSuneel Marthi
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagerySuneel Marthi
 
Streaming topic model training and inference
Streaming topic model training and inferenceStreaming topic model training and inference
Streaming topic model training and inferenceSuneel Marthi
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagerySuneel Marthi
 
Building streaming pipelines for neural machine translation
Building streaming pipelines for neural machine translationBuilding streaming pipelines for neural machine translation
Building streaming pipelines for neural machine translationSuneel Marthi
 
Moving beyond moving bytes
Moving beyond moving bytesMoving beyond moving bytes
Moving beyond moving bytesSuneel Marthi
 
Distributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache MahoutDistributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache MahoutSuneel Marthi
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream ProcessingSuneel Marthi
 

Mais de Suneel Marthi (8)

Measuring vegetation health to predict natural hazards
Measuring vegetation health to predict natural hazardsMeasuring vegetation health to predict natural hazards
Measuring vegetation health to predict natural hazards
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagery
 
Streaming topic model training and inference
Streaming topic model training and inferenceStreaming topic model training and inference
Streaming topic model training and inference
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagery
 
Building streaming pipelines for neural machine translation
Building streaming pipelines for neural machine translationBuilding streaming pipelines for neural machine translation
Building streaming pipelines for neural machine translation
 
Moving beyond moving bytes
Moving beyond moving bytesMoving beyond moving bytes
Moving beyond moving bytes
 
Distributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache MahoutDistributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache Mahout
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 

Último

Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Último (20)

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

Embracing diversity searching over multiple languages

  • 1. Embracing Diversity: Searching over multiple languages Tommaso Teofili Suneel Marthi June 12, 2017 Berlin Buzzwords, Berlin, Germany 1
  • 2. $WhoAreWe Tommaso Teofili  @tteofili Software Engineer, Adobe Systems Member of Apache Software Foundation, PMC Chair, Apache Lucene Committer and PMC on Apache Joshua, Apache OpenNLP, Apache JackRabbit Suneel Marthi  @suneelmarthi Principal Software Engineer, Office of Technology, Red Hat Member of Apache Software Foundation Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams 2
  • 3. Agenda What is Multi-Lingual Search ? Why Multi-Lingual Search ? What is Statistical Machine Translation ? Overview of Apache Joshua Dataflow Pipeline Demo 3
  • 5. Searching over content written in different languages with users speaking different languages both Parallel corpora Translating queries Translating documents 5
  • 7. Embracing diversity Most online tech content is in English Wikipedia dumps: en: 62GB de: 17GB it:   10GB Good number of non-English speaking users A lot of search queries are composed in English Preferable to retrieve search results in native language … or even to consolidate all results in one language 7
  • 8. UC1 — tech domain, native first 8
  • 9. UC2 — native only ? 9
  • 10. What is Machine Translation ? 10
  • 11. Generate Translations from Statistical Models trained on Bilingual Corpora. Translation happens per a probability distribution p(e/f) E = string in the target language (English) F = string in the source language (Spanish) e~ = argmax p(e/f) = argmax p(f/e) * p(e) e~ = best translation, the one with highest probability 11
  • 13. How to translate a word → lookup in dictionary Gebäude — building, house, tower. Multiple translations some more frequent than others for instance: house and building most common 13
  • 14. Look at a parallel corpus (German text along with English translation) Translation of Gebäude Count Probability house 5.28 billion 0.51 building 4.16 billion 0.402 tower 9.28 million 0.09 14
  • 15. Alignment In a parallel text (or when we translate), we align words in one language with the word in the other Das Gebäude ist hoch ↓ ↓ ↓ ↓ the building is high Word positions are numbered 1—4 15
  • 16. Alignment Function Define the Alignment with an Alignment Function Mapping an English target word at position i to a German source word at position j with a function a : i → j Example a : {1 → 1, 2 → 2, 3 → 3, 4 → 4} 16
  • 17. One-to-Many Translation A source word could translate into multiple target words Das ist ein Hochhaus    ↓ ↓ ↓ ↙ ↓ ↘ This is a high    rise building 17
  • 19. Alignment Function Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many translation can handle non- compositional phrases use of local context in translation the more data, the longer phrases can be learned “Standard Model”, used by Google Translate and others 19
  • 20. Phrase-Based Model Berlin ist ein herausragendes Kunst- und Kulturzentrum . ↓ ↓ ↓ ↓ ↓ ↓ Berlin is an outstanding Art and cultural center . Foreign input is segmented in phrases Each phrase is translated into English Phrases are reordered 20
  • 22. We have a mathematical model for translation p(e|f) Task of decoding: find the translation ebest with highest probability Two types of error the most probable translation is bad → fix the model search does not find the most probable translation → fix the search ebest = argmax p(e|f) 22
  • 23. Translation Process Translate this query from German into English er trinkt ja noch nichts er         ↓         he         Pick and input phrase, translate 23
  • 24. Translation Process Translate this query from German into English er trinkt ja noch nichts er     ja noch nichts ↓       he   does not yet   Pick and input phrase, translate 24
  • 25. Translation Process Translate this query from German into English er trinkt ja noch nichts er   trinkt   ja noch nichts ↓       he   does not yet   drink Pick and input phrase, translate 25
  • 27. Statistical Machine Translation Decoder for phrase-based and hierarch machine translation Written in Java Provide 64 language packs for machine translation https://cwiki.apache.org/confluence/display/JOSHUA/Language+P Project initiated by Johns Hopkins Univ. and University of Pennsylvania Presently incubating at Apache Software Foundation Used extensively by Amazon.com, NASA JPL https://cwiki.apache.org/confluence/display/JOSHUA  @ApacheJoshua 27
  • 29. References Apache Joshua — https://cwiki.apache.org/confluence/display/JOSHUA Apache OpenNLP — https://opennlp.apache.org GitHub — https://github.com/smarthi/BBuzz-multilang-search Slides — https://smarthi.github.io/bbuzz17-embracing-diversity-searching- over-multiple-languages/#/ 29
  • 30. Credits Joern Kottmann — PMC Chair, Apache OpenNLP Matt Post — PMC Chair, Apache Joshua Bruno P. Kinoshita — Committer on Apache OpenNLP, committer and PMC on Apache Commons and Apache Jena 30